When Sample Selection Bias Precipitates Model Collapse
Summary
This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.
View Cached Full Text
Cached at: 06/15/26, 09:09 AM
# When Sample Selection Bias Precipitates Model Collapse
Source: [https://arxiv.org/html/2606.13732](https://arxiv.org/html/2606.13732)
Xianglong DuWei LiuJingqi ZhangPeihua MaiMeng ZhangYan Pang
###### Abstract
The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs\. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier\. We show that in low\-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased\. This situation naturally arises in low\-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete\. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it\. We theoretically prove that such siloed selection accelerates collapse and induces power\-law diversity decay\. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data\. Empirical results confirm that local\-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic\-data pipelines require particular caution when real\-data coverage is fragmented or scarce\.
Machine Learning, ICML
## 1Introduction
The digital ecosystem is currently saturated with indistinguishable synthetic data, which is subsequently harvested for the training of future models\. When generative models are recursively trained on the outputs of their predecessors, they undergo a degenerative process that is not merely a stagnation of quality, but an active decay of statistical fidelity\. This phenomenon, observed empirically\(Bertrandet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib11)\)and described theoretically\(Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3)\), is widely termedmodel collapsein prior works\.111While the literature on model collapse encompasses a variety of interpretations, we focus on the definition ofCollapsing Variance\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1); Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3); Schaefferet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib2)\)\. See[AppendixA](https://arxiv.org/html/2606.13732#A1)for a review of related works\.Qualitatively, it describes a self\-consuming loop wherein subsequent model generations trained on synthetic data progressively dissociate from the tails of the true underlying distribution, converging toward a distorted representation of reality as probability density functions narrow\. Quantitatively, as the number of generations increases, the variance of the distribution shrinks, and the Wasserstein distance between the synthetic and true distributions diverges\(Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3); Shidaniet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib15)\)\. The implications of these self\-consuming loops are profound, ranging from loss of diversity\(Zhouet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib16); Wyllieet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib13); Taori and Hashimoto,[2023](https://arxiv.org/html/2606.13732#bib.bib12)\)to potential training instability and failures\(Bertrandet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib11); Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14)\)\.
The current consensus in the research community underscores the critical importance of data selection\(Fenget al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib4); Dohmatobet al\.,[2025a](https://arxiv.org/html/2606.13732#bib.bib7); Fuet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib10); Yiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib8)\): the active filtering of synthetic data to eliminate low\-quality samples\. Intuitively, if a perfect verifier \(modeled as an external oracle or high\-quality filter capable of distinguishing between synthetic data that preserves the true distribution and synthetic data that introduces noise\) exists, recursive training can be stabilized\. Indeed, under such ideal conditions, performance could potentially be optimized to surpass that of models trained solely on raw data\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5)\)\. Therefore, numerous data selection methodologies have been developed to mitigate model collapse\. For instance, in the domain of language models,Fenget al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib4)\)utilize metrics such as the ROUGE score to quantify the alignment between generated outputs and ground\-truth data, retaining the samples with the highest fidelity\.
However, consider low\-resource settings such as isolated institutions, including hospitals and banks, that operate under strict privacy regulations\. In such data\-scarce and siloed environments, these entities often resort to synthetic data for recursive self\-improvement in order to sustain model performance\(Huanget al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib17)\)\. Consequently, their verification mechanisms are inherently local, operating only on partial and biased slices of the global distribution\. When a generative model is vetted against such restricted criteria, the selected synthetic data fails to capture global diversity\. Instead, it reflects the verifier’s limited local prior\(Yiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib8)\)\. While prior theoretical works\(Ferbachet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib9); Wei and Zhang,[2025](https://arxiv.org/html/2606.13732#bib.bib32)\)have demonstrated related pathologies from other perspectives, such as the vanishing variance induced by preference\-based data curation, a critical gap remains in understanding the structural impact of data silos:
\(Q1\)How does model collapse manifest when sample selection is confined to data\-scarce environments?
In response toQ1,[Section˜3](https://arxiv.org/html/2606.13732#S3)analyzes biased sample selection dynamics, focusing on data\-scarce and low\-resource environments increasingly reliant on synthetic augmentation\. Combining theoretical intuition with empirical findings in[Subsection˜3\.1](https://arxiv.org/html/2606.13732#S3.SS1), we demonstrate that biased selection mechanisms governed by local priors function as inherently biased filters that are blind to the global manifold, ultimately resulting in a loss of diversity\. This process is similar to the diversity contraction observed in data curation based on human preferences\(Ferbachet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib9); Wei and Zhang,[2025](https://arxiv.org/html/2606.13732#bib.bib32)\), yet it is passively driven by inherent constraints\. Far from ensuring stability,[Subsection˜3\.2](https://arxiv.org/html/2606.13732#S3.SS2)shows that insular selection prunes the diversity essential for recursive training, steering the model toward collapsed data diversity at an asymptotic power\-law rate\. Although[Subsection˜3\.3](https://arxiv.org/html/2606.13732#S3.SS3)shows that equipping the verifier with global ground truth offers a theoretical remedy, intrinsic data scarcity and low\-resource constraints preclude this solution, presenting a dilemma:
\(Q2\)How can we verify synthetic data against a global reference distribution that no single entity possesses, while operating under data scarcity?
In response toQ2, we bridge this gap by proposing a collaborative framework inspired by current work\(Liet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib31)\)\. Although selection bias is well documented\(Ferbachet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib9)\), effective countermeasures for data silos are notably absent\. By utilizing the properties of Wasserstein geometry in[Subsection˜4\.2](https://arxiv.org/html/2606.13732#S4.SS2), we coordinate multiple data silos without direct data exchange to compute proxies: geodesic interpolations in[Subsection˜4\.3](https://arxiv.org/html/2606.13732#S4.SS3)or Wasserstein barycenters in[Subsection˜4\.4](https://arxiv.org/html/2606.13732#S4.SS4)\. These proxies serve as a collective reference, enabling multiple parties to score synthetic data, rather than relying on a single biased silo\.[Section˜6](https://arxiv.org/html/2606.13732#S6)demonstrates a significant reduction in model collapse within low\-resource communities, especially those characterized by data silos\.
## 2Preliminaries
##### Self\-Consuming Training Loops\.
Following standard formulations\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14); Shumailovet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib19)\), we define a self\-consuming loop as an iterative process where a modelℳt\\mathcal\{M\}\_\{t\}is trained on a dataset𝐗t\\mathbf\{X\}\_\{t\}derived from the synthetic output𝐗^t\\hat\{\\mathbf\{X\}\}\_\{t\}of its predecessorℳt−1\\mathcal\{M\}\_\{t\-1\}\. The literature categorizes such analyses into three paradigms:
- •Replace Paradigm\.The training set consists exclusively of new synthetic samples generated in the preceding round \(𝐗t=𝐗^t\\mathbf\{X\}\_\{t\}=\\hat\{\\mathbf\{X\}\}\_\{t\}\)\. Existing analysis \([Proposition˜1](https://arxiv.org/html/2606.13732#Thmproposition1)\) demonstrates that this paradigm induces catastrophicmodel collapse, characterized by progressive variance shrinkage and tail information loss\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14); Shumailovet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib19)\)\.
- •Accumulate Paradigm\.Prior literature suggests augmenting the initial real data𝐗0\\mathbf\{X\}\_\{0\}with all subsequent generations \(𝐗t=𝐗t−1∪𝐗^t\\mathbf\{X\}\_\{t\}=\\mathbf\{X\}\_\{t\-1\}\\cup\\hat\{\\mathbf\{X\}\}\_\{t\}\)\. Existing analysis \([Proposition˜2](https://arxiv.org/html/2606.13732#Thmproposition2)\) demonstrates that this paradigm prevents variance divergence, thereby ensuring recursive stability\(Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3); Dey and Donoho,[2024](https://arxiv.org/html/2606.13732#bib.bib21)\)\.
- •Accumulate\-Subsample Paradigm\.To mitigate the prohibitive costs of full accumulation, this method utilizes a fixed\-size subset sampled from the accumulated pool\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5); Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3); Dey and Donoho,[2024](https://arxiv.org/html/2606.13732#bib.bib21)\)\. Empirically, this alternative mitigates model collapse while satisfying computational constraints, particularly when paired with robust data selection mechanisms\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5)\)\.
##### Multivariate Gaussian Analysis Framework\.
We adopt the theoretical analysis framework established in the current literature\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1); Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14); Bertrandet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib11); Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3)\), distinguishing between two paradigms:ReplaceandAccumulate\.
The Replace Paradigm\.In this framework, the functional approximation step involves generatingnnsynthetic data points using the fitted parameters𝐗i,t∼𝒩\(𝝁t−1,𝚺t−1\)\\mathbf\{X\}\_\{i,t\}\\sim\\mathcal\{N\}\(\\bm\{\\mu\}\_\{t\-1\},\\bm\{\\Sigma\}\_\{t\-1\}\)fort∈\[0,T\]t\\in\[0,T\]; the corresponding distribution is hereafter referred to as𝒩t\\mathcal\{N\}\_\{t\}\. The recursive process is thus defined as:
Sampling:𝐗i,t=𝝁t−1\+𝚺t−11/2𝐳i,t\\displaystyle\\ \\mathbf\{X\}\_\{i,t\}=\\bm\{\\mu\}\_\{t\-1\}\+\\bm\{\\Sigma\}\_\{t\-1\}^\{1/2\}\\mathbf\{z\}\_\{i,t\}\(1\)Learning:\{𝝁t=1n∑i=1n𝐗i,t,𝚺t=1n−1∑i=1n\(𝐗i,t−𝝁t\)⊗2,\\displaystyle\\ \\left\\\{\\begin\{aligned\} \\bm\{\\mu\}\_\{t\}&=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{X\}\_\{i,t\},\\\\ \\bm\{\\Sigma\}\_\{t\}&=\\frac\{1\}\{n\-1\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{X\}\_\{i,t\}\-\\bm\{\\mu\}\_\{t\}\)^\{\\otimes 2\},\\end\{aligned\}\\right\.where𝐳i,t∼𝒩\(𝟎,𝐈d\)\\mathbf\{z\}\_\{i,t\}\\\!\\sim\\\!\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\)represents the stochastic term, and we define the product\(𝐗i,t−𝝁t\)\(𝐗i,t−𝝁t\)⊤≜\(𝐗i,t−𝝁t\)⊗2\(\\mathbf\{X\}\_\{i,t\}\-\\bm\{\\mu\}\_\{t\}\)\(\\mathbf\{X\}\_\{i,t\}\-\\bm\{\\mu\}\_\{t\}\)^\{\\top\}\\triangleq\(\\mathbf\{X\}\_\{i,t\}\-\\bm\{\\mu\}\_\{t\}\)^\{\\otimes 2\}\. Note that in the case of maximum likelihood estimation, the result is instead a biased variance estimator\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1)\)\. This recursive dependency precipitates model collapse, which is characterized by\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1)\):
###### Proposition 1\(Replace Paradigm,[Shumailovet al\.](https://arxiv.org/html/2606.13732#bib.bib1)\)\.
Under the Replace paradigm, as iterationt→∞t\\to\\infty, the estimator statistics satisfy:𝚺t→a\.s\.𝟎\.\\bm\{\\Sigma\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\.\(2\)Let𝕎2\\mathbb\{W\}\_\{2\}denote the Wasserstein\-2 distance\. We have𝔼\[𝕎22\(𝒩t,𝒩0\)\]→∞\.\\mathbb\{E\}\{\[\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{t\},\\mathcal\{N\}\_\{0\}\)\]\}\\to\\infty\.\(3\)Please refer to[Subsection˜B\.2](https://arxiv.org/html/2606.13732#A2.SS2)for the comprehensive proof\.[Proposition˜1](https://arxiv.org/html/2606.13732#Thmproposition1)indicates that the fitted distribution collapses to a point mass \(Dirac delta\), marking the elimination of diversity\. The diverging Wasserstein distance reflects the loss of information compared to the real data manifold\.
The Accumulate Paradigmaggregates samples from all prior generations\. For clarity, we designate the parameters as𝝁¯\\bar\{\\bm\{\\mu\}\}and𝚺¯\\bar\{\\bm\{\\Sigma\}\}\.Kazdanet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib3)\)characterized the process as:
Sampling:𝐗i,t=𝝁¯t−1\+𝚺¯t−11/2𝐳i,t,\\displaystyle\\ \\mathbf\{X\}\_\{i,t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{z\}\_\{i,t\},\(4\)Learning:\{𝝁¯t=1\(t\+1\)n∑τ=0t∑i=1n𝐗i,τ,𝚺¯t=1\(t\+1\)n−1∑τ=0t∑i=1n\(𝐗i,τ−𝝁¯t\)⊗2\.\\displaystyle\\left\\\{\\begin\{aligned\} \\bar\{\\bm\{\\mu\}\}\_\{t\}&\\\!\\\!=\\\!\\\!\\frac\{1\}\{\(t\\\!\+\\\!1\)n\}\\sum\_\{\\tau=0\}^\{t\}\\sum\_\{i=1\}^\{n\}\\mathbf\{X\}\_\{i,\\tau\},\\\\ \\bar\{\\bm\{\\Sigma\}\}\_\{t\}&\\\!=\\\!\\frac\{1\}\{\(t\\\!\+\\\!1\)n\\\!\-\\\!1\}\\sum\_\{\\tau=0\}^\{t\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{X\}\_\{i,\\tau\}\\\!\-\\\!\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\otimes 2\}\.\\end\{aligned\}\\right\.###### Proposition 2\(Accumulate Paradigm,[Kazdanet al\.](https://arxiv.org/html/2606.13732#bib.bib3)\)\.
Under the Accumulate paradigm, as iterationt→∞t\\to\\infty, we have the following convergence in expectation:𝔼\[\(𝝁¯t−𝝁¯0\)2\]→\(1−αn\)𝚺¯0,\\displaystyle\\mathbb\{E\}\[\(\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\)^\{2\}\]\\to\(1\-\\alpha\_\{n\}\)\\bar\{\\bm\{\\Sigma\}\}\_\{0\},\(5\)𝔼\[𝚺¯t\]→αn𝚺¯0,whereαn=sin\(π/n\)π/n\.\\displaystyle\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\]\\to\\alpha\_\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\},\\quad\\text\{where \}\\alpha\_\{n\}=\\frac\{\\sin\(\\pi/\\sqrt\{n\}\)\}\{\\pi/\\sqrt\{n\}\}\.\(6\)Consequently, the Wasserstein distance stabilizes at:𝔼\[𝕎22\(𝒩t,𝒩0\)\]→2\(1−αn\)Tr\(𝚺¯0\)\.\\mathbb\{E\}\{\[\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{t\},\\mathcal\{N\}\_\{0\}\)\]\}\\to 2\(1\-\\sqrt\{\\alpha\_\{n\}\}\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\.\(7\)Please refer to[Subsection˜B\.3](https://arxiv.org/html/2606.13732#A2.SS3)for the comprehensive proof\.[Proposition˜2](https://arxiv.org/html/2606.13732#Thmproposition2)establishes that data accumulation precludes collapse\. Now we formally define the Wasserstein distance\.
##### Wasserstein Distance\.
Let𝒫p\(ℝd\)\\mathcal\{P\}\_\{p\}\(\\mathbb\{R\}^\{d\}\)denote the space of probability measures with finitepp\-th moments on the metric spaceℝd\\mathbb\{R\}^\{d\}\. For any𝒫,𝒬∈𝒫p\\mathcal\{P\},\\\!\\mathcal\{Q\}\\\!\\in\\\!\\mathcal\{P\}\\\!\_\{p\}, thepp\-𝕎p\\mathbb\{W\}\_\{p\}distance reads:
𝕎p\(𝒫,𝒬\)=\(infπ∈Π\(𝒫,𝒬\)𝔼\(𝐱,𝐲\)∼π\[dp\(𝐱,𝐲\)\]\)1/p,\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)=\\Big\(\\inf\_\{\\pi\\in\\Pi\(\\mathcal\{P\},\\mathcal\{Q\}\)\}\\mathbb\{E\}\_\{\(\\mathbf\{x\},\\mathbf\{y\}\)\\sim\\pi\}\[d^\{p\}\(\\mathbf\{x\},\\mathbf\{y\}\)\]\\Big\)^\{1/p\},\(8\)whereΠ\(𝒫,𝒬\)\\Pi\(\\mathcal\{P\},\\mathcal\{Q\}\)denotes the set of couplings with marginals𝒫,𝒬\\mathcal\{P\},\\mathcal\{Q\}\. The ground metricd\(𝐱,𝐲\)d\(\\mathbf\{x\},\\mathbf\{y\}\)defines the distance between samples𝐱,𝐲∈ℝd\\mathbf\{x\},\\mathbf\{y\}\\\!\\in\\\!\\mathbb\{R\}^\{d\}\. We defer specific definitions ofd\(𝐱,𝐲\)d\(\\mathbf\{x\},\\mathbf\{y\}\)for vision and language to[Subsection˜C\.1](https://arxiv.org/html/2606.13732#A3.SS1)\.
## 3Theoretical Intuitions
Although[Proposition˜2](https://arxiv.org/html/2606.13732#Thmproposition2)guarantees stability under unbiased accumulation, we show that data selection is double\-edged: when the reference is local and fragmented, biased selection precipitates variance collapse \([Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)\), induces an explicit asymptotic decay rate \([Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)\), and incurs a downstream Wasserstein discrepancy cost \([Theorem˜3](https://arxiv.org/html/2606.13732#Thmtheorem3)\)\.
### 3\.1Selection Bias Precipitates Model Collapse
###### Assumption 1\.
For analytical tractability, we characterize the selection mechanism via a score functionU\(𝐱\):ℝd→ℝU\(\\mathbf\{x\}\):\\mathbb\{R\}^\{d\}\\\!\\to\\\!\\mathbb\{R\}, which is locally concave around a target state𝐱=𝐮∗\\mathbf\{x\}=\\mathbf\{u\}^\{\*\}, with initialization lying within the local basin of attraction\.
[Assumption˜1](https://arxiv.org/html/2606.13732#Thmassumption1)serves as a tractable formalism to analyze biased selection mechanisms, encompassing strategies ranging from metric\-based data pruning\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5)\)\(e\.g\., selecting images closest to the centroid\(Heet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib61)\)or covariance\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)of real features\(Fenget al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib4)\)\) in data\-scarce and low\-resource environments to active preference optimization strategies \(e\.g\., Best\-of\-N sampling\(Guiet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib33)\)\)\. While the biases in these scenarios originate from distinct motivations, characterized respectively by environmental constraints \(specifically the restriction to a local target𝐮∗\\mathbf\{u\}^\{\*\}\) and intentional preference curation \(specifically the restriction to a preferred target𝐮∗\\mathbf\{u\}^\{\*\}\), this mathematical abstraction captures their shared core goal: prioritizing samples that are proximal to a preferred ideal\.
At generationtt, we define the bounded selection regionℛt⊂ℝd\\mathcal\{R\}\_\{t\}\\subset\\mathbb\{R\}^\{d\}as a high\-utility neighborhood enclosing the target𝐮∗\\mathbf\{u\}^\{\*\}, dynamically calibrated to select the top\-α\\alphaprobability mass of the current sampling distribution, whereα∈\(0,1\)\\alpha\\in\(0,1\)represents the selection ratio, acting as a filtering budget \(e\.g\., selecting the top\-nncandidates fromNNgenerated data impliesα=n/N\\alpha=n/N\)\. Therefore, the selected data follow a truncated multivariate normal distribution, denoted as𝐗i,t∼𝒯𝒩\(𝝁¯t−1,𝚺¯t−1,ℛt\)\\mathbf\{X\}\_\{i,t\}\\sim\\mathcal\{TN\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\},\\mathcal\{R\}\_\{t\}\)\. Formally, the Accumulate paradigm withnnselected samples is formalized as:
Sampling:𝐗~i,t∼𝒩\(𝝁¯t−1,𝚺¯t−1\),\\displaystyle\\\!\\\!\\\!\\text\{\{Sampling:\}\}\\ \\tilde\{\\mathbf\{X\}\}\_\{i,t\}\\sim\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\),\(9\)Selecting:𝐗i,t∼𝒯𝒩\(𝝁¯t−1,𝚺¯t−1,ℛt\),\\displaystyle\\\!\\\!\\\!\{\\text\{\{Selecting:\}\}\\ \\mathbf\{X\}\_\{i,t\}\\sim\\mathcal\{TN\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\},\\mathcal\{R\}\_\{t\}\)\},s\.t\.∫ℛtp\(𝐱\|𝝁¯t−1,𝚺¯t−1\)𝑑𝐱=α,\\displaystyle\\quad\\quad\\quad\\quad\\ \{\\text\{s\.t\.\}\\int\_\{\\mathcal\{R\}\_\{t\}\}p\(\\mathbf\{x\}\|\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\)d\\mathbf\{x\}=\\alpha,\}Learning:\{𝝁¯t=1\(t\+1\)n∑τ=0t∑i=1n𝐗i,τ,𝚺¯t=1\(t\+1\)n−1∑τ=0t∑i=1n\(𝐗i,τ−𝝁¯t\)⊗2\.\\displaystyle\\\!\\\!\\\!\\text\{\{Learning:\}\}\\left\\\{\\begin\{aligned\} \\bar\{\\bm\{\\mu\}\}\_\{t\}&\\\!=\\\!\\\!\\frac\{1\}\{\(t\+1\)n\}\\sum\_\{\\tau=0\}^\{t\}\\sum\_\{i=1\}^\{n\}\\mathbf\{X\}\_\{i,\\tau\},\\\\ \\bar\{\\bm\{\\Sigma\}\}\_\{t\}&\\\!=\\\!\\frac\{1\}\{\(t\+1\)n\\\!\-\\\!1\}\\sum\_\{\\tau=0\}^\{t\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{X\}\_\{i,\\tau\}\\\!\-\\\!\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\otimes 2\}\.\\end\{aligned\}\\right\.
We demonstrate that biased selection breaks the stability of the Accumulate paradigm, leading to a new form of collapse\.
###### Theorem 1\(Selection Bias Precipitates Collapse\)\.
Consider the Accumulate paradigm with top\-α\\alphaselection toward an ideal𝐮∗\\mathbf\{u\}^\{\*\}\. As iterationt→∞t\\to\\infty, the estimator statistics behave as follows:‖𝝁¯t−𝐮∗‖2→a\.s\.0\.\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}\\\|^\{2\}\\xrightarrow\{a\.s\.\}0\.\\vskip\-2\.5pt\(10\)𝚺¯t→a\.s\.𝟎\.\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\.\(11\)The Wasserstein\-2 distance converges to:𝔼\[𝕎22\(𝒩t,𝒩0\)\]→‖𝐮∗−𝝁¯0‖2\+Tr\(𝚺¯0\)\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{t\},\\mathcal\{N\}\_\{0\}\)\]\\to\\\|\\mathbf\{u\}^\{\*\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\+\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(12\)Please refer to[Subsection˜B\.4](https://arxiv.org/html/2606.13732#A2.SS4)for the comprehensive proof\.[Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)shows that while the mean aligns with the target𝐮∗\\mathbf\{u\}^\{\*\}and the Wasserstein distance eventually stabilizes, the variance inexorably collapses from a diversity perspective\. This result formalizes how local\-reference selection can turn fidelity to a silo into diversity loss: although the mean aligns with𝐮∗\\mathbf\{u\}^\{\*\}, the variance needed to represent modes outside the local reference is progressively erased\. Furthermore,[Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)resonates with concurrent analyses\(Ferbachet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib9); Wei and Zhang,[2025](https://arxiv.org/html/2606.13732#bib.bib32)\), which demonstrate that optimizing for human preferences inadvertently leads to bias amplification\. We advance beyond qualitative observations to quantify the collapse rate of variance dissipation\.
### 3\.2The Asymptotic Rate of Model Collapse
To explicitly derive the collapse rate of the variance, we standardize the selection process to the isotropic frame\. Let𝒟t−1⊂ℝd\\mathcal\{D\}\_\{t\-1\}\\subset\\mathbb\{R\}^\{d\}denote thestandardized selection region:
𝒟t−1=\{𝐳∈ℝd∣𝝁¯t−1\+𝚺¯t−11/2𝐳∈ℛt\}\.\\mathcal\{D\}\_\{t\-1\}=\\\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}\\mid\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{z\}\\in\\mathcal\{R\}\_\{t\}\\\}\.\(13\)By strict enforcement of the selection ratio, the probability measure of this region satisfies∫𝒟t−1ϕd\(𝐳\)𝑑𝐳=α\\int\_\{\\mathcal\{D\}\_\{t\-1\}\}\\phi\_\{d\}\(\\mathbf\{z\}\)d\\mathbf\{z\}=\\alpha\. Consequently, the retained samples𝐗i,t\\mathbf\{X\}\_\{i,t\}can be reparameterized via the truncated variable𝜼i,t∼𝒯𝒩\(𝟎,𝐈d,𝒟t−1\)\\bm\{\\eta\}\_\{i,t\}\\\!\\sim\\\!\\mathcal\{TN\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\},\\mathcal\{D\}\_\{t\-1\}\)as:
𝐗i,t=𝝁¯t−1\+𝚺¯t−11/2𝜼i,t\.\\mathbf\{X\}\_\{i,t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bm\{\\eta\}\_\{i,t\}\.\(14\)We characterize the statistics of𝜼i,t\\bm\{\\eta\}\_\{i,t\}via its first two moments conditioned on the filtration\{ℱt\}t≥0\\\{\\mathcal\{F\}\_\{t\}\\\}\_\{t\\geq 0\}, whereℱt\\mathcal\{F\}\_\{t\}represents the information available up to timett\.
- •Mean Drift \(at−1\\bm\{a\}\_\{t\-1\}\):Defined as𝒂t−1≜𝔼\[𝜼i,t\|ℱt−1\]\\bm\{a\}\_\{t\-1\}\\triangleq\\mathbb\{E\}\[\\bm\{\\eta\}\_\{i,t\}\|\\mathcal\{F\}\_\{t\-1\}\]\. Geometrically, the vector𝒂t−1\\bm\{a\}\_\{t\-1\}denotes the directional force driven by the asymmetry of the selection region, propelling the empirical mean towards the target𝐮∗\\mathbf\{u\}^\{\*\}\.
- •Covariance Contraction \(𝐁t−1\\mathbf\{B\}\_\{t\-1\}\):Defined as𝐁t−1≜Cov\(𝜼i,t\|ℱt−1\)\\mathbf\{B\}\_\{t\-1\}\\triangleq\\text\{Cov\}\(\\bm\{\\eta\}\_\{i,t\}\|\\mathcal\{F\}\_\{t\-1\}\)\. The matrix𝐁t−1\\mathbf\{B\}\_\{t\-1\}captures the reduction in uncertainty due to biased data selection\.
Figure 1:Multivariate Gaussian Modeling\. We observe thatsample selection bias precipitates model collapsein both Replace and Accumulate paradigms\. In this setting, diversity under the Accumulate paradigm with selection dissipates at a power\-law rate\.Accordingly, we define the dissipation matrix𝚿t−1\\mathbf\{\\Psi\}\_\{t\-1\}as:
𝚿t−1≜𝐈d−\(𝐁t−1\+𝒂t−1𝒂t−1⊤\)\.\\mathbf\{\\Psi\}\_\{t\-1\}\\triangleq\\mathbf\{I\}\_\{d\}\-\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\)\.\\vskip\-6\.49994pt\(15\)Under the local\-basin analysis above, the first\-order dissipation matrix admits a uniform positive spectral gap\. In particular, there exists a constantψ\>0\\psi\\\!\>\\\!0such thatλmin\(𝚿t−1\)≥ψ\\lambda\_\{\\min\}\(\\mathbf\{\\Psi\}\_\{t\-1\}\)\\\!\\geq\\\!\\psialong the trajectory\.
###### Theorem 2\(Asymptotic Rate of Model Collapse\)\.
Assume that the trajectory remains in the local basin for all iterations\. As iterationt→∞t\\to\\infty,Tr\(𝚺¯t\)=𝒪a\.s\.\(t−ψ\)\.\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)=\\mathcal\{O\}\_\{a\.s\.\}\\big\(t^\{\-\\psi\}\\big\)\.\(16\)
Please refer to[Subsection˜B\.5](https://arxiv.org/html/2606.13732#A2.SS5)for the proof\.[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)demonstrates apower\-lawrate\. Intuitively, this shows a two\-phase collapse dynamic: ① an initial phase of rapid homogenization driven by strict sample filtering, ② followed by slow, asymptotic convergence to a Dirac point mass\. This quantification reveals the tension: stricter alignment with the local target𝐮∗\\mathbf\{u\}^\{\*\}can improve local fidelity while increasing the dissipation magnitudeλmin\(𝚿∞\)\\lambda\_\{\\min\}\(\\mathbf\{\\Psi\}\_\{\\infty\}\)\. In data\-scarce or low\-resource environments, this means that a verifier with limited reference coverage may accelerate tail erosion by selecting samples that appear locally high\-quality\.
Empirically, we validate these findings in[Figure˜1](https://arxiv.org/html/2606.13732#S3.F1)using the Multivariate Gaussian Modeling framework inKazdanet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib3)\); Shumailovet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib1)\)\. Monitoring the ratioTr\(𝚺¯t\)/Tr\(𝚺¯0\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)/\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\), we compare the Replace paradigm \(n=200n=200samples per iteration\) against the Accumulate paradigm\. When incorporating a selection mechanism that filters the top\-nnsamples closest to a target state𝐮∗\\mathbf\{u\}^\{\*\}, we observe diversity collapse in both paradigms\. Specifically, the Replace paradigm suffers from rapid variance depletion; the Accumulate paradigm exhibits the power\-law decay predicted in[Theorems˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)and[2](https://arxiv.org/html/2606.13732#Thmtheorem2): it mirrors the rapid initial collapse of the Replace paradigm but subsequently transitions into a protracted asymptotic decay\. Due to space limitations, we defer additional results on non\-Gaussian distributions and extended configurations to[Subsection˜C\.4](https://arxiv.org/html/2606.13732#A3.SS4)\.
### 3\.3The Wasserstein Cost of Model Collapse
While our variance\-decay analysis leverages Gaussian approximations to obtain explicit dynamics, the impact of collapse on the generalization performance of the downstream prediction task does not hinge on such parametric assumptions\. To quantify this effect under minimal distributional structure, we adopt standard regularity conditions commonly used in learning\-theoretic optimal transport generalization\(Courtyet al\.,[2017](https://arxiv.org/html/2606.13732#bib.bib36); Redkoet al\.,[2017](https://arxiv.org/html/2606.13732#bib.bib37)\)\.
###### Assumption 2\.
Letℋ\\mathcal\{H\}be a hypothesis class of predictorsh:𝒳→𝒴h:\\mathcal\{X\}\\to\\mathcal\{Y\}, and we assume the following regularity conditions\.
- •Hypothesis Lipschitzness\.Everyhhisϵ\\epsilon\-Lipschitz w\.r\.t\.d𝒳d\_\{\\mathcal\{X\}\}, i\.e\.,‖h\(x\)−h\(x′\)‖≤ϵd𝒳\(x,x′\)\\\|h\(x\)\-h\(x^\{\\prime\}\)\\\|\\leq\\epsilon\\,d\_\{\\mathcal\{X\}\}\(x,x^\{\\prime\}\)for allx,x′∈𝒳x,x^\{\\prime\}\\in\\mathcal\{X\}\.
- •Loss Lipschitzness\.The lossℒ\(y^,y\)\\mathcal\{L\}\(\\hat\{y\},y\)isℓ\\ell\-Lipschitz in both inputs w\.r\.t\. the bounded task\-metricd𝒴d\_\{\\mathcal\{Y\}\}, i\.e\.,\|ℒ\(y^1,y1\)−ℒ\(y^2,y2\)\|≤ℓ\(d𝒴\(y^1,y^2\)\+d𝒴\(y1,y2\)\)\|\\mathcal\{L\}\(\\hat\{y\}\_\{1\},y\_\{1\}\)\-\\mathcal\{L\}\(\\hat\{y\}\_\{2\},y\_\{2\}\)\|\\leq\\ell\(d\_\{\\mathcal\{Y\}\}\(\\hat\{y\}\_\{1\},\\hat\{y\}\_\{2\}\)\+d\_\{\\mathcal\{Y\}\}\(y\_\{1\},y\_\{2\}\)\)for ally^,y∈𝒴\\hat\{y\},y\\in\\mathcal\{Y\}\.
- •\(ϵ,δ\)\(\\epsilon,\\delta\)\-Probabilistic Cross\-Lipschitzness\.Letg∗:𝒳→𝒴g^\{\*\}:\\mathcal\{X\}\\to\\mathcal\{Y\}denote the global oracle mapping \(ground truth\)\. The local verification process imposes a biased effective supervision signalgt:𝒳→𝒴g\_\{t\}:\\mathcal\{X\}\\to\\mathcal\{Y\}on the selected samples\. We assume the local viewgtg\_\{t\}and the global viewg∗g^\{\*\}satisfy\(ϵ,δ\)\(\\epsilon,\\delta\)\-probabilistic cross\-Lipschitzness\(Justet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib35)\)\.
Letℛ𝒟\(h;g\)≜𝔼x∼𝒟\[ℒ\(h\(x\),g\(x\)\)\]\\mathcal\{R\}\_\{\\mathcal\{D\}\}\(h;g\)\\triangleq\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\[\\mathcal\{L\}\(h\(x\),g\(x\)\)\]denote the expected risk of predictorhhunder distribution𝒟\\mathcal\{D\}\. FollowingJustet al\.\([2023](https://arxiv.org/html/2606.13732#bib.bib35)\), we have the following generalization cost:
###### Theorem 3\(Wasserstein Cost of Model Collapse\)\.
Let[Assumption˜2](https://arxiv.org/html/2606.13732#Thmassumption2)hold\. Let𝒟t\\mathcal\{D\}\_\{t\}be the distribution of synthetic data filtered by local verification at generationtt, and let𝒟∗\\mathcal\{D\}^\{\*\}be the true data manifold\. For anyht∈ℋh\_\{t\}\\\!\\in\\\!\\mathcal\{H\}trained on𝒟t\\mathcal\{D\}\_\{t\}, the expected risk is bounded byℛ𝒟∗\\displaystyle\\\!\\\!\\mathcal\{R\}\_\{\\mathcal\{D\}^\{\*\}\}\(ht;g∗\)≤\\displaystyle\(h\_\{t\};g^\{\*\}\)\\leq\(17\)2\\displaystyle 2ℓϵ𝕎p\(𝒟t,𝒟∗\)\+ℛ𝒟t\(ht;gt\)\+𝒪\(ℓδ\)\.\\displaystyle\\ell\\epsilon\\,\\mathbb\{W\}\_\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)\\ \+\\mathcal\{R\}\_\{\\mathcal\{D\}\_\{t\}\}\(h\_\{t\};g\_\{t\}\)\+\\ \\mathcal\{O\}\(\\ell\\,\\delta\)\.
Please refer to[Subsection˜B\.6](https://arxiv.org/html/2606.13732#A2.SS6)for the comprehensive proof\. [Theorem˜3](https://arxiv.org/html/2606.13732#Thmtheorem3)decomposes the expected risk on the true manifold𝒟∗\\mathcal\{D\}^\{\*\}into three components: \(i\) the distributional discrepancy measured by the Wasserstein distance𝕎p\(𝒟t,𝒟∗\)\\mathbb\{W\}\_\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\); \(ii\) the expected risk on the current filtered distribution𝒟t\\mathcal\{D\}\_\{t\}; and \(iii\) an error term due to probabilistic violations\. Whenhth\_\{t\}is well\-optimized on the local synthetic data𝒟t\\mathcal\{D\}\_\{t\}, the second term becomes negligible\. Consequently, the generalization performance is primarily dominated by the first term\.[Theorem˜3](https://arxiv.org/html/2606.13732#Thmtheorem3)suggests that access to ground truth could theoretically preclude mode collapse\. However, low\-resource and sparsity constraints render such access infeasible in siloed environments\. Unlike preference\-based curation, where bias stems from explicit human objectives and is mitigable via established techniques\(Groveret al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib41); Chenet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib42)\), the verification bias in data silos is an intrinsic consequence of fragmented access to the global distribution\.
## 4Methodology
### 4\.1Wasserstein\-Gradient\-Based Selection
Inspired byKessleret al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib49)\); Liet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib31)\), we delineate the theoretical foundation of the selection mechanism\. The core objective is to quantify the contribution of individual synthetic samples to the global distributional alignment\. Specifically, given a synthetic dataset𝒫=\{xi\}i=1n\\mathcal\{P\}=\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}and a reference real dataset𝒬\\mathcal\{Q\}, we examine the sensitivity of the Wasserstein distance𝕎p\(𝒫,𝒬\)\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)with respect to perturbations ofxix\_\{i\}\. Conceptually, this is similar to influence functions \(IF\)\(Koh and Liang,[2017](https://arxiv.org/html/2606.13732#bib.bib50)\)which were adopted for adversarial curation strategies in self\-consuming generative models\(Wei and Zhang,[2025](https://arxiv.org/html/2606.13732#bib.bib32)\)\. A key distinction is that, while IF\-based methods rely on linear approximations for infinitesimal perturbations, the discrete Wasserstein distance is formulated as a linear programming problem \(LP\)\. According to the Sensitivity Theorem\(Bertsekas,[1997](https://arxiv.org/html/2606.13732#bib.bib43)\), the gradients derived from the LP dual solution remain valid for perturbations within a local polytope\. This allows us to reliably predict the variation in Wasserstein distance induced by re\-weighting a sample without the need for re\-calculation\.
Similar toLiet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib31)\), we utilize the first\-order variation induced by the Kantorovich dual potentials\. Recall the dual formulation:𝕎p\(𝒫,𝒬\)=sup\(f,g\)∈Φc\(⟨f,𝒫⟩\+⟨g,𝒬⟩\)\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)=\\sup\_\{\(f,g\)\\in\\Phi\_\{c\}\}\\left\(\\langle f,\\mathcal\{P\}\\rangle\+\\langle g,\\mathcal\{Q\}\\rangle\\right\), whereΦc\\Phi\_\{c\}denotes the set of admissible potentials satisfyingf\(x\)\+g\(y\)≤dp\(x,y\)f\(x\)\+g\(y\)\\leq d^\{p\}\(x,y\)\. The optimal dual potentialf∗f^\{\*\}serves as the subgradient of the transport cost with respect to the probability massPP, i\.e\.,∇𝒫𝕎p\(𝒫,𝒬\)=\(f∗\)⊤\\nabla\_\{\\mathcal\{P\}\}\{\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)\}=\(f^\{\*\}\)^\{\\top\}\.
To eliminate the degree of freedom arising from the translational invariance of the simplex constraint \(∑i=1NP=1\\sum\_\{i=1\}^\{N\}P=1\), the zero\-sum convention is adopted to fix the subgradients, following the strategy inJustet al\.\([2023](https://arxiv.org/html/2606.13732#bib.bib35)\)\. Consequently, for a samplexi∈𝒫x\_\{i\}\\in\\mathcal\{P\}, the calibrated gradients are derived by:
𝒮\(xi\)≜∂𝕎p\(𝒫,𝒬\)∂𝒫\(xi\)=f∗\(xi\)−1N−1∑j≠iNf∗\(xj\)\.\\\!\\\!\\\!\\\!\\mathcal\{S\}\(x\_\{i\}\)\\triangleq\\frac\{\\partial\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)\}\{\\partial\\mathcal\{P\}\(x\_\{i\}\)\}=f^\{\*\}\(x\_\{i\}\)\-\\frac\{1\}\{N\-1\}\\sum\_\{j\\neq i\}^\{N\}f^\{\*\}\(x\_\{j\}\)\.\\vskip\-3\.50006pt\(18\)
Intuitively, the synthetic dataset𝒫\\mathcal\{P\}could be transmitted to decentralized parties holding real data shards𝒬k\\mathcal\{Q\}\_\{k\}\(k=1,…,Kk=1,\\dots,K\) to compute the sensitivity scores𝒮k\(xi\)\\mathcal\{S\}\_\{k\}\(x\_\{i\}\)\. However, recent studies\(Ganev and De Cristofaro,[2025](https://arxiv.org/html/2606.13732#bib.bib52); Chenet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib51); van Breugelet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib53)\)highlight that synthetic data remains susceptible to privacy leakage\. This critical limitation motivates privacy\-preserving methods based on the following Wasserstein geometry property\.
### 4\.2Wasserstein Geometry Property
We first introduce intrinsic geometric properties in𝒫p\(ℝd\)\\mathcal\{P\}\_\{p\}\(\\mathbb\{R\}^\{d\}\)\.
###### Property 1\(Wasserstein Barycenter,Agueh and Carlier \([2011](https://arxiv.org/html/2606.13732#bib.bib25)\)\)\.
Let\{𝒬k\}k=1K⊂𝒫p\(ℝd\)\\\{\\mathcal\{Q\}\_\{k\}\\\}\_\{k=1\}^\{K\}\\subset\\mathcal\{P\}\_\{p\}\(\\mathbb\{R\}^\{d\}\)denote a collection of square\-integrable probability measures with weights\{λk\}k=1K\\\{\\lambda\_\{k\}\\\}\_\{k=1\}^\{K\}such that∑k=1Kλk=1\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}=1, where typicallyλk=1/K\\lambda\_\{k\}=1/K\. The Wasserstein barycenter𝒬∗\\mathcal\{Q\}^\{\*\}is characterized as:
𝒬∗=argmin𝒬∈𝒫p\(ℝd\)∑k=1Kλk𝕎pp\(𝒬,𝒬k\)\.\\mathcal\{Q\}^\{\*\}=\\mathop\{\\arg\\min\}\_\{\\mathcal\{Q\}\\in\\mathcal\{P\}\_\{p\}\(\\mathbb\{R\}^\{d\}\)\}\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathbb\{W\}\_\{p\}^\{p\}\(\\mathcal\{Q\},\\mathcal\{Q\}\_\{k\}\)\.\\vskip\-1\.25pt\(19\)
###### Property 2\(McCann’s Interpolation,McCann \([1997](https://arxiv.org/html/2606.13732#bib.bib46)\); Rakotomamonjyet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib29)\)\)\.
For any two continuous measures𝒫0,𝒫1∈𝒫p\(ℝd\)\\mathcal\{P\}\_\{0\},\\mathcal\{P\}\_\{1\}\\in\\mathcal\{P\}\_\{p\}\(\\mathbb\{R\}^\{d\}\), the displacement interpolation𝒫t\\mathcal\{P\}\_\{t\}fort∈\[0,1\]t\\in\[0,1\]defines the optimal trajectory between them\. It is constructed via a transport map𝐓\\mathbf\{T\}pushing𝒫0\\mathcal\{P\}\_\{0\}to𝒫1\\mathcal\{P\}\_\{1\}:
𝒫t=\(\(1−t\)𝐈d\+t𝐓\)\#𝒫0\.\\mathcal\{P\}\_\{t\}=\(\(1\-t\)\\mathbf\{I\}\_\{d\}\+t\\mathbf\{T\}\)\_\{\\\#\}\\mathcal\{P\}\_\{0\}\.\\vskip\-2\.5pt\(20\)In the discrete setting, we approximate the intractable map𝐓\\mathbf\{T\}using barycentric mapping\(Courtyet al\.,[2018](https://arxiv.org/html/2606.13732#bib.bib55)\)\. The interpolation is realized by the trajectoryxi\(t\)=\(1−t\)xi\+t⋅n\(𝐏⋆𝐗′\)ix\_\{i\}\(t\)=\(1\-t\)x\_\{i\}\+t\\cdot n\(\\mathbf\{P\}^\{\\star\}\\mathbf\{X\}^\{\\prime\}\)\_\{i\}, where the barycentric projectionn\(𝐏⋆𝐗′\)in\(\\mathbf\{P\}^\{\\star\}\\mathbf\{X\}^\{\\prime\}\)\_\{i\}acts as the empirical proxy for𝐓\(xi\)\\mathbf\{T\}\(x\_\{i\}\)\(Peyréet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib28)\)\.
###### Property 3\(Wasserstein Geodesics,Ambrosioet al\.\([2005](https://arxiv.org/html/2606.13732#bib.bib44)\); Kolouriet al\.\([2017](https://arxiv.org/html/2606.13732#bib.bib45)\)\)\.
Let𝐓\\mathbf\{T\}be the optimal transport map pushing𝒫\\mathcal\{P\}to𝒬\\mathcal\{Q\}\. For anyt∈\(0,1\)t\\in\(0,1\), the intermediate interpolantξ∗\\xi^\{\*\}is constructed asξ∗=\(\(1−t\)𝐈d\+t𝐓\)\#𝒫\\xi^\{\*\}=\(\(1\-t\)\\mathbf\{I\}\_\{d\}\+t\\mathbf\{T\}\)\_\{\\\#\}\\mathcal\{P\}\. This interpolant satisfies:
𝕎p\(𝒫,𝒬\)=𝕎p\(𝒫,ξ∗\)\+𝕎p\(ξ∗,𝒬\)\.\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\)=\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\xi^\{\*\}\)\+\\mathbb\{W\}\_\{p\}\(\\xi^\{\*\},\\mathcal\{Q\}\)\.\\vskip\-3\.00003pt\(21\)
##### Main Intuition\.
Liet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib31)\)show that Wasserstein distances and barycenters can be computed without sharing raw data by constructing a proxy interpolationξ∗\\xi^\{\*\}\(via[Property˜2](https://arxiv.org/html/2606.13732#Thmproperty2)\) on the geodesic between two distributions𝒫\\mathcal\{P\}and𝒬\\mathcal\{Q\}; this interpolant satisfies the geodesic property in[Property˜3](https://arxiv.org/html/2606.13732#Thmproperty3)\.
### 4\.3Collaborative Geodesic Interpolation Calculation
We now presentScheme I\. Motivated by[Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1), by broadening the selection criteria to encompass multiple target preferences rather than a solitary preferenceu∗u^\{\*\}, one can expand the selection spaceℛt\\mathcal\{R\}\_\{t\}, thereby mitigating collapse\.
We utilize[Property˜3](https://arxiv.org/html/2606.13732#Thmproperty3)to avoid direct data communication\. Whenξk∗\\xi\_\{k\}^\{\*\}is an interpolation measure \(computed by[Property˜2](https://arxiv.org/html/2606.13732#Thmproperty2)\) on the geodesic between𝒫\\mathcal\{P\}and𝒬k\\mathcal\{Q\}\_\{k\}, the dual formulation is𝕎p\(𝒫,𝒬k\)=sup\(fk,gk\)∈Φc\(⟨fk,𝒫⟩\+⟨gk,ξk∗⟩\)\+sup\(hk,jk\)∈Φc\(⟨hk,ξk∗⟩\+⟨jk,𝒬k⟩\)\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\_\{k\}\)=\\sup\_\{\(f\_\{k\},g\_\{k\}\)\\in\\Phi\_\{c\}\}\\left\(\\langle f\_\{k\},\\mathcal\{P\}\\rangle\\\!\+\\\!\\langle g\_\{k\},\\xi\_\{k\}^\{\*\}\\rangle\\right\)\\\!\+\\\!\\sup\_\{\(h\_\{k\},j\_\{k\}\)\\in\\Phi\_\{c\}\}\\left\(\\langle h\_\{k\},\\xi\_\{k\}^\{\*\}\\rangle\\\!\+\\\!\\langle j\_\{k\},\\mathcal\{Q\}\_\{k\}\\rangle\\right\)\.
We thus have∇𝒫𝕎p\(𝒫,𝒬k\)≈∇𝒫𝕎p\(𝒫,ξk∗\)=\(f∗\)⊤\\nabla\_\{\\mathcal\{P\}\}\{\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\_\{k\}\)\}\\approx\\nabla\_\{\\mathcal\{P\}\}\{\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{k\}^\{\*\}\)\}=\(f^\{\*\}\)^\{\\top\}, which implies we can compute𝒮k\(xi\)\\mathcal\{S\}\_\{k\}\(x\_\{i\}\)using the proxy interpolationsξk∗\\xi\_\{k\}^\{\*\}without directly accessing real data𝒬k\\mathcal\{Q\}\_\{k\}\.
𝒮k\(xi\)=∂𝕎p\(𝒫,ξk∗\)∂𝒫\(xi\)=f∗\(xi\)−1N−1∑j≠iNf∗\(xj\)\\\!\\\!\\mathcal\{S\}\_\{k\}\(x\_\{i\}\)=\\\!\\frac\{\\partial\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{k\}^\{\*\}\)\}\{\\partial\\mathcal\{P\}\(x\_\{i\}\)\}=f^\{\*\}\(x\_\{i\}\)\-\\frac\{1\}\{\\\!N\\\!\-\\\!1\}\\sum\_\{j\\neq i\}^\{N\}f^\{\*\}\(x\_\{j\}\)\\vskip\-2\.5pt\(22\)Therefore, the key problem is finding interpolations on the Wasserstein geodesics between𝒫\\mathcal\{P\}and𝒬k\\mathcal\{Q\}\_\{k\}without sharing𝒫\\mathcal\{P\}and𝒬k\\mathcal\{Q\}\_\{k\}\. As illustrated in[Figure˜2](https://arxiv.org/html/2606.13732#S4.F2), we abstract the overall process as a geometric procedure: For roundr=1,…,Rr=1,\\ldots,R: Initialization\.An arbitrary proxy measureξk\(0\)\\xi\_\{k\}^\{\(0\)\}is initialized, typically as a Gaussian distribution or a selection from a public dataset proximate to the target distribution\.
1. 1\.Interpolation\.The interpolationsξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}are computed from the proxyξk\(r\)\\xi\_\{k\}^\{\(r\)\}to𝒫\\mathcal\{P\}and𝒬k\\mathcal\{Q\}\_\{k\}, respectively, i\.e\.,ξ𝒫\(r\)∈argminξ𝒲p\(𝒫,ξ\)\+𝒲p\(ξ,ξk\(r\)\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\\in\\arg\\min\_\{\\xi\}\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\)\+\\mathcal\{W\}\_\{p\}\(\\xi,\\xi\_\{k\}^\{\(r\)\}\)andξ𝒬k\(r\)∈argminξ𝒲p\(𝒬k,ξ\)\+𝒲p\(ξ,ξk\(r\)\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\\in\\arg\\min\_\{\\xi\}\\mathcal\{W\}\_\{p\}\\left\(\\mathcal\{Q\}\_\{k\},\\xi\\right\)\+\\mathcal\{W\}\_\{p\}\(\\xi,\\xi\_\{k\}^\{\(r\)\}\)\.
2. 2\.Communication\.The interpolantξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}is transmitted by the party holding𝒫\\mathcal\{P\}to the party holding𝒬k\\mathcal\{Q\}\_\{k\}\.
3. 3\.Update\.The updated proxy measureξk\(r\+1\)\\xi\_\{k\}^\{\(r\+1\)\}is calculated via an interpolation measure betweenξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}and is subsequently returned to the synthetic data party\.
###### Theorem 4\(Monotonicity and Interpolation Convergence,Rakotomamonjyet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib29)\)\)\.
For interpolationξk\(r\)\\xi\_\{k\}^\{\(r\)\}at iterationrr, define the sequence as:ℰk\(r\)=𝒲p\(𝒬k,ξk\(r\)\)\+𝒲p\(ξk\(r\),𝒫\)\\mathcal\{E\}\_\{k\}^\{\(r\)\}=\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{k\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\)\},\\mathcal\{P\}\)\\vskip\-2\.5pt\(23\)Then, the sequence\{ℰ\(r\)\}r≥0\\\{\\mathcal\{E\}^\{\(r\)\}\\\}\_\{r\\geq 0\}is non\-increasing and converges to its infimumℰk∗=𝒲p\(𝒬k,𝒫\)\\mathcal\{E\}\_\{k\}^\{\*\}=\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\mathcal\{P\}\)\.Please refer to[Subsection˜B\.7](https://arxiv.org/html/2606.13732#A2.SS7)for the comprehensive proof\.
Figure 2:Collaborative Geodesic Interpolation Calculation\.Holders of synthetic data𝒫\\mathcal\{P\}and real data𝒬k\\mathcal\{Q\}\_\{k\}jointly compute the proxy measureξk∗\\xi\_\{k\}^\{\*\}along the geodesic without raw data exchange\.ξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}denote intermediate interpolants between the proxy measureξk\(r\)\\xi\_\{k\}^\{\(r\)\}and the distributions𝒫\\mathcal\{P\}and𝒬k\\mathcal\{Q\}\_\{k\}, respectively\.

Figure 3:Collaborative Wasserstein Barycenter Estimation\.\(Left\) Schematic illustration of the iterative interpolation process for computing the approximate barycenterξ\(r\)\\xi^\{\(r\)\}\. \(Middle\) Empirical visualization of the optimization trajectory, demonstrating the convergence from an arbitrary initialization to the barycenter\. \(Right\) Numerical convergence results showing the monotonic decrease\.The above process is fully consistent with the process inRakotomamonjyet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib29)\)for computing the Wasserstein distance, while we useξk∗\\xi\_\{k\}^\{\*\}as a proxy signal of𝒬k\\mathcal\{Q\}\_\{k\}for data selection\. Specifically, let the candidate synthetic dataset be represented as an empirical measure𝒫=1N∑i=1Nδxi\\mathcal\{P\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\delta\_\{x\_\{i\}\}\. For each candidate samplexix\_\{i\}, each partykkcontributes a score𝒮k\(xi\)\\mathcal\{S\}\_\{k\}\(x\_\{i\}\); colloquially, every party scoresxix\_\{i\}\. We select a subset of indicesℐ\\mathcal\{I\}with cardinality constraint\|ℐ\|≤n<N\|\\mathcal\{I\}\|\\\!\\leq\\\!n<N, yielding the filtered distribution𝒫ℐ=1n∑i∈ℐδxi\\mathcal\{P\}\_\{\\mathcal\{I\}\}\\\!=\\\!\\frac\{1\}\{n\}\\\!\\sum\_\{i\\in\\mathcal\{I\}\}\\delta\_\{x\_\{i\}\}\.
To ensure𝒫ℐ\\mathcal\{P\}\_\{\\mathcal\{I\}\}covers the distributional modes of all parties while avoiding redundancy, we draw inspiration from recommendation systems that maximize the coverage of diverse user interests\(Krause and Golovin,[2014](https://arxiv.org/html/2606.13732#bib.bib56)\)and mitigate social bias in synthetic data\(Mehrotra and Vishnoi,[2023](https://arxiv.org/html/2606.13732#bib.bib59)\)\. We formulate the selection as amonotone submodular maximization problemby modeling the diminishing return of adding similar local synthetic data for each individual party\. Colloquially, this means that multiple parties score synthetic data, instead of a single biased data silo\.
maximizeℐ⊆\{1,…,N\}∑k=1Kg\(∑i∈ℐ\(1−𝒮~k\(xi\)\)\)s\.t\.\|ℐ\|≤n,\\underset\{\\mathcal\{I\}\\subseteq\\\{1,\\dots,N\\\}\}\{\\text\{maximize\}\}\\quad\\sum\_\{k=1\}^\{K\}g\\big\(\\sum\_\{i\\in\\mathcal\{I\}\}\(1\-\\tilde\{\\mathcal\{S\}\}\_\{k\}\(x\_\{i\}\)\)\\big\)\\quad\\text\{s\.t\.\}\\ \\ \|\\mathcal\{I\}\|\\leq n,\\vskip\-1\.00006pt\(24\)where𝒮~k\(xi\)\\tilde\{\\mathcal\{S\}\}\_\{k\}\(x\_\{i\}\)denotes the normalized score \(min\-max normalization\) andg:ℝ≥0→ℝg:\\mathbb\{R\}\_\{\\geq 0\}\\to\\mathbb\{R\}is a non\-decreasing concave function \(e\.g\.,g\(z\)=log\(1\+z\)g\(z\)\\\!=\\\!\\log\(1\+z\)\) that penalizes redundancy\. Since the composition of a concave function with a non\-negative modular sum is guaranteed to be submodular\(Bach,[2013](https://arxiv.org/html/2606.13732#bib.bib58)\), this problem can be solved using a standard greedy algorithm, which provides a guarantee of\(1−1/e\)\(1\-1/e\)\-approximation to the optimum\(Nemhauseret al\.,[1978](https://arxiv.org/html/2606.13732#bib.bib57)\)\.
However, this approach is limited in scalability, as any modification to the synthetic dataset𝒫\\mathcal\{P\}necessitates a complete re\-computation of the interpolation\. Therefore, we propose an offline framework using a Wasserstein barycenter\.
### 4\.4Collaborative Wasserstein Barycenter Estimation
We now presentScheme II\. Motivated by[Theorem˜3](https://arxiv.org/html/2606.13732#Thmtheorem3), which suggests that if the filtered distribution approximates the ground truth, we can theoretically mitigate mode collapse\. To achieve this, we compute a proxy for the ground truth distribution, i\.e\., the Wasserstein barycenter \([Property˜1](https://arxiv.org/html/2606.13732#Thmproperty1)\)\. Similar to[Subsection˜4\.3](https://arxiv.org/html/2606.13732#S4.SS3), we abstract the process as the following geometric procedure illustrated in[Figure˜3](https://arxiv.org/html/2606.13732#S4.F3): Initialization\.A central server initializes an estimated barycenterξ\(0\)\\xi^\{\(0\)\}with multiple parties holding real data distributions𝒬k\\mathcal\{Q\}\_\{k\}fork=0,1,2k=0,1,2\. For roundr=0,…,R−1r=0,\\ldots,R\-1:
1. 1\.Interpolation\.The server broadcasts the current estimateξ\(r\)\\xi^\{\(r\)\}\. Each partykkcomputes the geodesic interpolationξk\(r\)\\xi\_\{k\}^\{\(r\)\}betweenξ\(r\)\\xi^\{\(r\)\}and its local distribution𝒬k\\mathcal\{Q\}\_\{k\}, i\.e\.,ξk\(r\)∈argminξ𝒲p\(𝒬k,ξ\)\+𝒲p\(ξ,ξ\(r\)\)\\xi\_\{k\}^\{\(r\)\}\\in\\arg\\min\_\{\\xi\}\\mathcal\{W\}\_\{p\}\\left\(\\mathcal\{Q\}\_\{k\},\\xi\\right\)\+\\mathcal\{W\}\_\{p\}\\left\(\\xi,\\xi^\{\(r\)\}\\right\)\.
2. 2\.Communication\.The server aggregates the interpolations\{ξk\(r\)\}k\\\{\\xi\_\{k\}^\{\(r\)\}\\\}\_\{k\}from all participating parties\.
3. 3\.Update\.The server updates the estimated barycenterξ\(r\+1\)=∑k=1K1/K⋅ξk\(r\)\\xi^\{\(r\+1\)\}=\\sum\_\{k=1\}^\{K\}1/K\\cdot\\xi\_\{k\}^\{\(r\)\}\.
###### Theorem 5\(Monotonicity and Barycenter Convergence\)\.
Let𝒬k\\mathcal\{Q\}\_\{k\}be the distribution of thekk\-th client with weightλk\>0\\lambda\_\{k\}\>0\. Letξ\(r\)\\xi^\{\(r\)\}denote the global approximated barycenter at iterationrr\. Define the true Fréchet variance \(objective sequence\) as:ℰ\(r\)=∑k=1Kλk𝒲22\(𝒬k,ξ\(r\)\)\\mathcal\{E\}^\{\(r\)\}=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathcal\{W\}\_\{2\}^\{2\}\(\\mathcal\{Q\}\_\{k\},\\xi^\{\(r\)\}\)\\vskip\-6\.00006pt\(25\)Then, the sequence\{ℰ\(r\)\}r≥0\\\{\\mathcal\{E\}^\{\(r\)\}\\\}\_\{r\\geq 0\}is non\-increasing and converges to its infimumℰ∗=∑k=1Kλk𝒲22\(𝒬k,ξ∗\)\\mathcal\{E\}^\{\*\}=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathcal\{W\}\_\{2\}^\{2\}\(\\mathcal\{Q\}\_\{k\},\\xi^\{\*\}\)\.Please refer to[Subsection˜B\.8](https://arxiv.org/html/2606.13732#A2.SS8)for the comprehensive proof\. Empirically,[Figure˜3](https://arxiv.org/html/2606.13732#S4.F3)shows that our method yields an approximation that is virtually indistinguishable from the true Wasserstein barycenter computed by the standard free\-support algorithm\(Cuturi and Doucet,[2014](https://arxiv.org/html/2606.13732#bib.bib60)\)\.
By utilizing the calibrated gradient𝒮\(xi\)=∂𝕎p\(𝒫,ξ∗\)∂𝒫\(xi\)=f∗\(xi\)−1N−1∑j≠iNf∗\(xj\)\\mathcal\{S\}\(x\_\{i\}\)=\\\!\\frac\{\\partial\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\xi^\{\*\}\)\}\{\\partial\\mathcal\{P\}\(x\_\{i\}\)\}=f^\{\*\}\(x\_\{i\}\)\-\\frac\{1\}\{N\-1\}\\sum\_\{j\\neq i\}^\{N\}f^\{\*\}\(x\_\{j\}\), we identify top\-α\\alphasamples aligned with the global ground truth\. Crucially, this decouples the proxy estimation from the synthetic data generation, allowingξ∗\\xi^\{\*\}to be reused even as synthetic data𝒫\\mathcal\{P\}changes\.
Figure 4:Computational Time Scalability Analysis\. Wall\-clock time for one round of data selection under varying candidate set sizeNN, reference set size per clientMM, and number of clientsKK\.
## 5Computational Complexity
We further evaluate the scalability of the proposed selection mechanisms by measuring the wall\-clock time required for one round of data selection\. CIFAR\-10 is used as the base dataset, and the candidate pool is enlarged toN=200,000N=200\{,\}000by replicating the original 50,000 images\. We compare two variants:Interpolation Proxy\([Algorithm˜1](https://arxiv.org/html/2606.13732#alg1)\) andBarycenter Proxy\([Algorithm˜2](https://arxiv.org/html/2606.13732#alg2)\)\. The benchmark follows a parallel simulation protocol, where client\-side computations are assumed to run concurrently, matching the federated setting\.
Unless otherwise specified, the default configuration isN=200,000N=200\{,\}000candidate samples,M=5,000M=5\{,\}000reference samples per client,K=10K=10clients, andR=10R=10interpolation rounds\. For each experiment, we vary one parameter while keeping the remaining parameters fixed\. The reported runtime includes GPU data transfer, distance matrix computation, and selection or transport operations, while excluding initial data loading and communication overhead\.
We now summarize the leading computational cost\. LetN=\|𝒫\|N=\|\\mathcal\{P\}\|denote the size of the synthetic candidate pool,M=\|𝒬k\|M=\|\\mathcal\{Q\}\_\{k\}\|the size of each local real dataset,SSthe support size of the proxy distribution,LLthe number of Sinkhorn scaling iterations,RRthe number of interpolation rounds inScheme I,TTthe number of barycenter estimation rounds inScheme II, andnnthe selection budget\. The complexity is reported as parallel wall\-clock complexity\.
###### Theorem 6\(Computational Complexity\)\.
Under Sinkhorn\-based OT withLLscaling iterations, the leading parallel wall\-clock complexities of our schemes are𝒯Scheme I=𝒪\(RL\(N\+M\+S\)S\+nNK\),\\displaystyle\\mathcal\{T\}\_\{\\texttt\{Scheme I\}\}=\\mathcal\{O\}\\\!\\left\(RL\(N\+M\+S\)S\+nNK\\right\),\(26\)𝒯Scheme II=𝒪\(TLMS\+LNS\)\.\\displaystyle\\mathcal\{T\}\_\{\\texttt\{Scheme II\}\}=\\mathcal\{O\}\\\!\\left\(TLMS\+LNS\\right\)\.The first term in𝒯Scheme I\\mathcal\{T\}\_\{\\texttt\{Scheme I\}\}comes from the collaborative geodesic interpolation between the synthetic distribution𝒫\\mathcal\{P\}and each local real distribution𝒬k\\mathcal\{Q\}\_\{k\}\. Since this proxy depends on the current candidate pool, a new synthetic pool generally requires recomputing the interpolation\. The second term corresponds to greedy sample selection, where marginal gains are evaluated acrossKKclient\-side scores\. By contrast,Scheme IIseparates proxy estimation from candidate selection\. Its barycenter estimation cost,TLMSTLMS, depends on the local real distributions and the barycenter support size, but not on the synthetic candidate sizeNN\. Once the barycenter proxy is obtained, scoring a new synthetic pool requires only a single Sinkhorn\-based pass of costLNSLNS\. This decoupling is important in iterative synthetic data generation: when𝒫\\mathcal\{P\}changes,Scheme IIcan reuse the barycenter proxy, whereasScheme Imust recompute the interpolation proxy\. As shown in[Figure˜4](https://arxiv.org/html/2606.13732#S4.F4), both methods scale approximately linearly withNNandMM, consistent with[Theorem˜6](https://arxiv.org/html/2606.13732#Thmtheorem6)\. However, their behavior differs more clearly asKKincreases\.Scheme Ibecomes more expensive because it repeatedly interacts with client\-specific interpolation proxies and aggregates client\-wise selection scores\. In contrast,Scheme IIremains nearly flat in the parallel setting, since the barycenter estimation is decoupled from the synthetic candidate pool and can be amortized across multiple rounds of synthetic data generation\. This explains why the empirical gap in wall\-clock time is larger than what a single\-round comparison alone suggests\.


Figure 5:Evolution of the proportionof the Airplane class and the nine other classes using the Airplane class as a local reference dataset \(Left\)\.Evolution of FID \(↓\\downarrow\)of the generated images using ExDir\(1,0\.1\)\(1,0\.1\)\(Middle\) and IID partition \(Right\) as local reference datasets\.Table 1:Results after 10 iterationswith ExDir\(1,0\.1\)\(1,0\.1\)reference set\. Colors indicate the1st,2nd, and3rdbest results\.CIFAR\-10STL\-10CelebAMethodFID↓\\downarrowPrecision↑\\uparrowRecall↑\\uparrowFID↓\\downarrowPrecision↑\\uparrowRecall↑\\uparrowFID↓\\downarrowPrecision↑\\uparrowRecall↑\\uparrowRandom1060\.530\.48950\.490\.53960\.510\.28K\-means\(Linet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib62)\)1020\.560\.40890\.540\.53870\.590\.48CenterMatch\(Heet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib61)\)1160\.500\.351110\.570\.58870\.640\.46CovMatch\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)1150\.510\.471310\.590\.55920\.650\.51Scheme II\([Subsection˜4\.4](https://arxiv.org/html/2606.13732#S4.SS4)\)850\.570\.57690\.570\.63750\.700\.62Scheme I\([Subsection˜4\.3](https://arxiv.org/html/2606.13732#S4.SS3)\)710\.600\.58650\.660\.71690\.690\.71
## 6Experiments
We follow the settings \(baselines, configurations, and metrics\) inShidaniet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib15)\); Rezaeiet al\.\([2026](https://arxiv.org/html/2606.13732#bib.bib6)\); Bertrandet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib11)\)\. Due to space constraints, detailed settings and supplementary results are deferred to[Appendix˜C](https://arxiv.org/html/2606.13732#A3)\.
BaselinesFollowingRezaeiet al\.\([2026](https://arxiv.org/html/2606.13732#bib.bib6)\),CovMatch\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)uses a greedy strategy to match the feature covariance of real data;CenterMatch\(Heet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib61)\)selects images closest to the centroid of real features; andK\-means\(Linet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib62)\)clusters the generated pool into groups and selects images closest to each cluster center\.
MetricsFollowingBertrandet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib11)\), we report Fréchet Inception Distance \(FID\)\(Heuselet al\.,[2017](https://arxiv.org/html/2606.13732#bib.bib64)\)for quality, as well asPrecisionfor fidelity andRecallfor diversity based on Inception\-V3\(Kynkäänniemiet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib63)\)\.
SetupFollowingShiet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib5)\), we conduct experiments with DDPM\(Hoet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib71)\)acrossCIFAR\-10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2606.13732#bib.bib67)\),CelebA\(Liuet al\.,[2015](https://arxiv.org/html/2606.13732#bib.bib68)\), andSTL\-10\(Coateset al\.,[2011](https://arxiv.org/html/2606.13732#bib.bib74)\)\. In the main text, we primarily investigate theAccumulate\-Subsampleparadigm: we selectnninstances fromN=4nN=4ngenerated candidates to augment the data pool, and subsequently subsamplenninstances for training from the pool\. Since Dirichlet partitions yield weak initial models, we first utilize the training set of sizen=50,000n=50,000to establish a strong generator\. Note that the training set is not added to the accumulated pool; instead, the data is distributed among 10 parties to serve as local reference sets\. In the specific case of STL\-10, we allocate the 5,000 labeled samples across the 10 parties\.
[Figure˜5](https://arxiv.org/html/2606.13732#S5.F5)\(Left\) illustrates the proportion of each class in the training set predicted by a pre\-trained VGG11 \(92\.39% test accuracy\) when only the Airplane class is used as the reference dataset forCenterMatch\. The results highlight that when the selection mechanism is guided by a skewed local prior, the diversity deteriorates rapidly, leading to homogenization\.[Figure˜5](https://arxiv.org/html/2606.13732#S5.F5)\(Middle & Right\) illustrates that while selection baselines mitigate collapse in IID scenarios, they surprisingly lag behindRandomin non\-IID scenarios\. [Table˜1](https://arxiv.org/html/2606.13732#S5.T1)shows that baselines with poor performance on CIFAR\-10 and STL\-10 achieve better results on face generation\. Intuitively, this stems from the highly structured nature of face data: even with a biased reference set, the filtered images still retain the basic characteristics\.
## 7Lessons Learned
Sample Selection Bias Accelerates Collapse in Data Silos\.Although local filtering is introduced to mitigate model collapse, optimizing synthetic data solely against local reference signals can have the opposite effect\. In siloed environments, the local real\-data distribution provides only a partial view of the target distribution\. As a result, filtering synthetic samples by their agreement with local ground truth induces a confirmation\-bias effect: samples that deviate from the local reference, including valid but underrepresented modes, are more likely to be discarded\. This gradually narrows the synthetic training distribution and accelerates diversity loss across recursive generations\.
Low\-Resource Regimes Are Especially Vulnerable\.This failure mode becomes more pronounced when real\-data coverage is scarce or fragmented\. In low\-resource regimes, tail regions are weakly represented even before synthetic augmentation begins, so local\-reference selection can easily confuse rare but valid modes with low\-quality generations\. The filtering process therefore does not merely remove noise; it systematically suppresses underrepresented regions of the target distribution\. Data scarcity is thus amplified into persistent tail pruning, making low\-resource silos particularly susceptible to collapse\.
## Impact Statement
##### Advancing the Sustainability of Generative AI\.
This research addresses a critical existential risk facing the future of artificial intelligence: model collapse driven by recursive training on synthetic data\. As the digital ecosystem becomes saturated with machine\-generated content, future models risk degenerating into homogenized representations of reality that lose statistical fidelity\. By theoretically quantifying the diversity \(variance\) decay rate as a power law and providing a geometric solution to mitigate it, this work offers a pathway to sustain the self\-improvement of Generative AI systems without necessitating a continuous influx of human\-generated data, which is becoming a scarce resource\.
##### Ethical Considerations and Limitations\.
There remains a risk that if the collaborative proxy is poisoned or biased by a majority of participating nodes, the selection mechanism could enforce a collective bias rather than a true ground truth\. Therefore, deployment of these methods requires careful governance to ensure the participating entities in a siloed network represent a sufficiently diverse cross\-section of the target domain\. Although the algorithmic dynamics regarding adversarial attacks and defenses are intriguing, this setting exceeds the scope of this study, which is dedicated to investigating model collapse within data silos and taking the first step toward its resolution by shifting from collaborative learning to collaborative evaluation\.
## Acknowledgements
This work was supported in part by the Chongqing Key Laboratory of Trusted Perception and Interaction Technology for Intelligent and Connected Vehicles, the State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing Changan Automobile Co\., Ltd\., and the Chongqing Natural Science Foundation \(Grant No\. CSTB2024NSCQ\-LZX0172\), and in part by the National Natural Science Foundation of China \(Grant No\. 62572433\)\.
## References
- M\. Agueh and G\. Carlier \(2011\)Barycenters in the wasserstein space\.SIAM Journal on Mathematical Analysis43\(2\),pp\. 904–924\.Cited by:[Property 1](https://arxiv.org/html/2606.13732#Thmproperty1)\.
- S\. Alemohammad, J\. Casco\-Rodriguez, L\. Luzi, A\. I\. Humayun, H\. Babaei, D\. LeJeune, A\. Siahkoohi, and R\. Baraniuk \(2024\)Self\-consuming generative models go MAD\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ShjMHfmPs0)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p2.1),[§1](https://arxiv.org/html/2606.13732#S1.p1.1),[1st item](https://arxiv.org/html/2606.13732#S2.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Alvarez\-Melis and N\. Fusi \(2020\)Geometric dataset distances via optimal transport\.Advances in Neural Information Processing Systems33,pp\. 21428–21439\.Cited by:[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.SSS0.Px1.p1.1)\.
- L\. Ambrosio, N\. Gigli, G\. Savare,et al\.\(2005\)Gradient flows in metric spaces and in the space of probability measures\.Cited by:[Property 3](https://arxiv.org/html/2606.13732#Thmproperty3)\.
- F\. Bach \(2013\)Learning with submodular functions: a convex optimization perspective\.Foundations and Trends® in Machine Learning6\(2\-3\),pp\. 145–373\.Cited by:[§4\.3](https://arxiv.org/html/2606.13732#S4.SS3.p6.5)\.
- Q\. Bertrand, J\. Bose, A\. Duplessis, M\. Jiralerspong, and G\. Gidel \(2024\)On the stability of iterative retraining of generative models on their own data\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JORAfH2xFd)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p3.1)\.
- D\. P\. Bertsekas \(1997\)Nonlinear programming\.Journal of the Operational Research Society48\(3\),pp\. 334–334\.Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p1.4)\.
- D\. Chen, N\. Yu, Y\. Zhang, and M\. Fritz \(2020\)GAN\-leaks: a taxonomy of membership inference attacks against generative models\.InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security,CCS ’20,New York, NY, USA,pp\. 343–362\.External Links:[Link](https://doi.org/10.1145/3372297.3417238)Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p4.4)\.
- T\. Chen, Y\. Hirota, M\. Otani, N\. Garcia, and Y\. Nakashima \(2024\)Would deep generative models amplify bias in future models?\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 10833–10843\.Cited by:[§3\.3](https://arxiv.org/html/2606.13732#S3.SS3.p2.8)\.
- A\. Coates, A\. Y\. Ng, and H\. Lee \(2011\)An analysis of single\-layer networks in unsupervised feature learning\.InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,pp\. 215–223\.External Links:[Link](http://proceedings.mlr.press/v15/coates11a.html)Cited by:[§6](https://arxiv.org/html/2606.13732#S6.p4.4)\.
- N\. Courty, R\. Flamary, and M\. Ducoffe \(2018\)Learning wasserstein embeddings\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SJyEH91A-)Cited by:[Property 2](https://arxiv.org/html/2606.13732#Thmproperty2.p1.10.4)\.
- N\. Courty, R\. Flamary, A\. Habrard, and A\. Rakotomamonjy \(2017\)Joint distribution optimal transportation for domain adaptation\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/0070d23b06b1486a538c0eaa45dd167a-Paper.pdf)Cited by:[§3\.3](https://arxiv.org/html/2606.13732#S3.SS3.p1.1)\.
- M\. Cuturi and A\. Doucet \(2014\)Fast computation of wasserstein barycenters\.InInternational conference on machine learning,pp\. 685–693\.Cited by:[§4\.4](https://arxiv.org/html/2606.13732#S4.SS4.p2.2)\.
- M\. Cuturi \(2013\)Sinkhorn distances: lightspeed computation of optimal transport\.Advances in neural information processing systems26\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1)\.
- A\. Dey and D\. Donoho \(2024\)Universality of theπ2/6\\pi^\{2\}/6pathway in avoiding model collapse\.arXiv preprint arXiv:2410\.22812\.Cited by:[2nd item](https://arxiv.org/html/2606.13732#S2.I1.i2.p1.2),[3rd item](https://arxiv.org/html/2606.13732#S2.I1.i3.p1.1)\.
- E\. Dohmatob, Y\. Feng, and J\. Kempe \(2024\)Model collapse demystified: the case of regression\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=bioHNTRnQk)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1)\.
- E\. Dohmatob, Y\. Feng, A\. Subramonian, and J\. Kempe \(2025a\)Strong model collapse\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=et5l9qPUhm)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p2.1)\.
- E\. Dohmatob, M\. Pezeshki, and R\. Askari\-Hemmat \(2025b\)Why less is more \(sometimes\): a theory of data curation\.arXiv preprint arXiv:2511\.03492\.Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p1.1)\.
- P\. Dvurechenskii, D\. Dvinskikh, A\. Gasnikov, C\. Uribe, and A\. Nedich \(2018\)Decentralize and randomize: faster algorithm for wasserstein barycenters\.Advances in Neural Information Processing Systems31\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1)\.
- Y\. Feng, E\. Dohmatob, P\. Yang, F\. Charton, and J\. Kempe \(2025\)Beyond model collapse: scaling up with synthesized data requires verification\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p4.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p5.1),[§C\.5](https://arxiv.org/html/2606.13732#A3.SS5.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.13732#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p1.2)\.
- D\. Ferbach, Q\. Bertrand, J\. Bose, and G\. Gidel \(2024\)Self\-consuming generative models with curated data provably optimize human preferences\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=cyv0LkIaoH)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p3.1),[§1](https://arxiv.org/html/2606.13732#S1.p5.1),[§1](https://arxiv.org/html/2606.13732#S1.p7.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p6.2)\.
- S\. Fu, Y\. Wang, Y\. Chen, L\. Shen, and D\. Tao \(2025\)Self\-verification provably prevents model collapse in recursive synthetic training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=X5Hk8aMs6w)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p2.1)\.
- G\. Ganev and E\. De Cristofaro \(2025\)The inadequacy of similarity\-based privacy metrics: privacy attacks against “truly anonymous” synthetic datasets\.In2025 IEEE Symposium on Security and Privacy \(SP\),pp\. 4007–4025\.Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p4.4)\.
- M\. Gerstgrasser, R\. Schaeffer, A\. Dey, R\. Rafailov, T\. Korbak, H\. Sleight, R\. Agrawal, J\. Hughes, D\. B\. Pai, A\. Gromov, D\. Roberts, D\. Yang, D\. L\. Donoho, and S\. Koyejo \(2024\)Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=5B2K4LRgmz)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1)\.
- A\. Grover, J\. Song, A\. Kapoor, K\. Tran, A\. Agarwal, E\. J\. Horvitz, and S\. Ermon \(2019\)Bias correction of learned generative models using likelihood\-free importance weighting\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/d76d8deea9c19cc9aaf2237d2bf2f785-Paper.pdf)Cited by:[§3\.3](https://arxiv.org/html/2606.13732#S3.SS3.p2.8)\.
- L\. Gui, C\. Garbacea, and V\. Veitch \(2024\)BoNBon alignment for large language models and the sweetness of best\-of\-n sampling\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=haSKMlrbX5)Cited by:[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p1.2)\.
- T\. Hasan, A\. Bhattacharjee, M\. S\. Islam, K\. Mubasshir, Y\. Li, Y\. Kang, M\. S\. Rahman, and R\. Shahriyar \(2021\)XL\-sum: large\-scale multilingual abstractive summarization for 44 languages\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,pp\. 4693–4703\.Cited by:[§C\.5](https://arxiv.org/html/2606.13732#A3.SS5.SSS0.Px2.p1.1)\.
- R\. Hataya, H\. Bao, and H\. Arai \(2023\)Will Large\-scale Generative Models Corrupt Future Datasets?\.In2023 IEEE/CVF International Conference on Computer Vision \(ICCV\),Los Alamitos, CA, USA,pp\. 20498–20508\.External Links:[Document](https://dx.doi.org/10.1109/ICCV51070.2023.01879),[Link](https://doi.ieeecomputersociety.org/10.1109/ICCV51070.2023.01879)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p2.1)\.
- R\. He, S\. Sun, X\. Yu, C\. Xue, W\. Zhang, P\. Torr, S\. Bai, and X\. QI \(2023\)IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nUmCcZ5RKF)Cited by:[2nd item](https://arxiv.org/html/2606.13732#A3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p1.2),[Table 1](https://arxiv.org/html/2606.13732#S5.T1.14.12.1),[§6](https://arxiv.org/html/2606.13732#S6.p2.3)\.
- M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter \(2017\)Gans trained by a two time\-scale update rule converge to a local nash equilibrium\.Advances in neural information processing systems30\.Cited by:[§6](https://arxiv.org/html/2606.13732#S6.p3.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[2nd item](https://arxiv.org/html/2606.13732#A3.I3.i2.p1.4),[§C\.2\.3](https://arxiv.org/html/2606.13732#A3.SS2.SSS3.Px1.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p4.4)\.
- J\. Huang, S\. Gu, L\. Hou, Y\. Wu, X\. Wang, H\. Yu, and J\. Han \(2023\)Large language models can self\-improve\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.67/)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p3.1)\.
- P\. Humane, P\. Cudrano, D\. Z\. Kaplan, M\. Matteucci, S\. Chakraborty, and I\. Rish \(2025\)Influence functions for efficient data selection in reasoning\.InNeurIPS 2025 Workshop on Efficient Reasoning,External Links:[Link](https://openreview.net/forum?id=wQyK6NEpoy)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1)\.
- D\. Jarvis, R\. Klein, B\. Rosman, S\. James, and S\. S\. Mannelli \(2026\)Position: the stochastic parrot in the coal mine\. model collapse is a threat to low\-resource communities\.InProceedings of the 43rd International Conference on Machine Learning,Proceedings of Machine Learning Research\.External Links:[Link](https://icml.cc/virtual/2026/poster/67117)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p2.1),[§C\.5](https://arxiv.org/html/2606.13732#A3.SS5.SSS0.Px2.p2.1)\.
- H\. A\. Just, F\. Kang, T\. Wang, Y\. Zeng, M\. Ko, M\. Jin, and R\. Jia \(2023\)LAVA: data valuation without pre\-specified learning algorithms\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JJuP86nBl4q)Cited by:[3rd item](https://arxiv.org/html/2606.13732#S3.I2.i3.p1.6),[§3\.3](https://arxiv.org/html/2606.13732#S3.SS3.p2.3),[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p3.2)\.
- J\. Kazdan, R\. Schaeffer, A\. Dey, M\. Gerstgrasser, R\. Rafailov, D\. L\. Donoho, and S\. Koyejo \(2025\)Collapse or thrive: perils and promises of synthetic data in a self\-generating world\.InForty\-second International Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1),[§1](https://arxiv.org/html/2606.13732#S1.p1.1),[2nd item](https://arxiv.org/html/2606.13732#S2.I1.i2.p1.2),[3rd item](https://arxiv.org/html/2606.13732#S2.I1.i3.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p4.2),[§3\.2](https://arxiv.org/html/2606.13732#S3.SS2.p3.4),[Proposition 2](https://arxiv.org/html/2606.13732#Thmproposition2),[footnote 1](https://arxiv.org/html/2606.13732#footnote1)\.
- K\. Kenthapadi, A\. Korolova, I\. Mironov, and N\. Mishra \(2013\)Privacy via the johnson\-lindenstrauss transform\.Journal of Privacy and Confidentiality5\(1\),pp\. 39–71\.Cited by:[§C\.6](https://arxiv.org/html/2606.13732#A3.SS6.p1.17)\.
- S\. Kessler, T\. Le, and V\. Nguyen \(2025\)SAVA: scalable learning\-agnostic data valuation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0UCoWxPhQ4)Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p1.4)\.
- P\. W\. Koh and P\. Liang \(2017\)Understanding black\-box predictions via influence functions\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1885–1894\.External Links:[Link](https://proceedings.mlr.press/v70/koh17a.html)Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p1.4)\.
- S\. Kolouri, S\. R\. Park, M\. Thorpe, D\. Slepcev, and G\. K\. Rohde \(2017\)Optimal mass transport: signal processing and machine\-learning applications\.IEEE signal processing magazine34\(4\),pp\. 43–59\.Cited by:[Property 3](https://arxiv.org/html/2606.13732#Thmproperty3)\.
- A\. Krause and D\. Golovin \(2014\)Submodular function maximization\.\.Tractability3\(71\-104\),pp\. 3\.Cited by:[§4\.3](https://arxiv.org/html/2606.13732#S4.SS3.p6.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§6](https://arxiv.org/html/2606.13732#S6.p4.4)\.
- M\. Kusner, Y\. Sun, N\. Kolkin, and K\. Weinberger \(2015\)From word embeddings to document distances\.InInternational conference on machine learning,pp\. 957–966\.Cited by:[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.SSS0.Px2.p1.1)\.
- T\. Kynkäänniemi, T\. Karras, S\. Laine, J\. Lehtinen, and T\. Aila \(2019\)Improved precision and recall metric for assessing generative models\.Advances in neural information processing systems32\.Cited by:[2nd item](https://arxiv.org/html/2606.13732#A3.I2.i2.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p3.1)\.
- N\. Lê Tien, A\. Habrard, and M\. Sebban \(2019\)Differentially private optimal transport: application to domain adaptation\.\.InIJCAI,pp\. 2852–2858\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1),[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p2.1),[§C\.6](https://arxiv.org/html/2606.13732#A3.SS6.p1.17),[Remark 2](https://arxiv.org/html/2606.13732#Thmremark2.p1.1.1)\.
- W\. Li, S\. Fu, F\. Zhang, and Y\. Pang \(2024\)Data valuation and detections in federated learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12027–12036\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1),[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.p3.1),[§1](https://arxiv.org/html/2606.13732#S1.p7.1),[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p1.4),[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p2.6),[§4\.2](https://arxiv.org/html/2606.13732#S4.SS2.SSS0.Px1.p1.3)\.
- W\. Li and Y\. Pang \(2024\)Private wasserstein distance\.arXiv preprint arXiv:2404\.06787\.Cited by:[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.p3.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,pp\. 74–81\.Cited by:[§C\.5](https://arxiv.org/html/2606.13732#A3.SS5.SSS0.Px2.p1.1)\.
- J\. Lin, L\. Tao, M\. Dong, and C\. Xu \(2025\)Diffusion attribution score: evaluating training data influence in diffusion model\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=kuutidLf6R)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1)\.
- S\. Lin, K\. Wang, X\. Zeng, and R\. Zhao \(2023\)Explore the power of synthetic data on few\-shot object detection\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 638–647\.Cited by:[3rd item](https://arxiv.org/html/2606.13732#A3.I1.i3.p1.3),[Table 1](https://arxiv.org/html/2606.13732#S5.T1.13.11.1),[§6](https://arxiv.org/html/2606.13732#S6.p2.3)\.
- X\. Liu, Z\. Xie, and S\. Zhang \(2025\)Extensions of robbins\-siegmund theorem with applications in reinforcement learning\.arXiv preprint arXiv:2509\.26442\.Cited by:[Lemma 3](https://arxiv.org/html/2606.13732#Thmlemma3)\.
- Z\. Liu, P\. Luo, X\. Wang, and X\. Tang \(2015\)Deep learning face attributes in the wild\.InProceedings of International Conference on Computer Vision \(ICCV\),Cited by:[§6](https://arxiv.org/html/2606.13732#S6.p4.4)\.
- P\. Mai, Y\. Ding, Z\. Lyu, M\. Du, and Y\. Pang \(2025\)SecEmb: sparsity\-aware secure federated learning of on\-device recommender system with large embedding\.Proceedings of Machine Learning Research267,pp\. 42642–42667\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1)\.
- P\. Mai, Y\. Yang, R\. Yan, and Y\. Pang \(2026\)Nexus scissor: enhance open\-access language model safety by connection pruning\.npj Artificial Intelligence2\(1\),pp\. 1\.Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p2.1)\.
- R\. J\. McCann \(1997\)A convexity principle for interacting gases\.Advances in mathematics128\(1\),pp\. 153–179\.Cited by:[Property 2](https://arxiv.org/html/2606.13732#Thmproperty2)\.
- A\. Mehrotra and N\. K\. Vishnoi \(2023\)Maximizing submodular functions for recommendation in the presence of biases\.InProceedings of the ACM Web Conference 2023,WWW ’23,New York, NY, USA,pp\. 3625–3636\.External Links:ISBN 9781450394161,[Link](https://doi.org/10.1145/3543507.3583195),[Document](https://dx.doi.org/10.1145/3543507.3583195)Cited by:[§4\.3](https://arxiv.org/html/2606.13732#S4.SS3.p6.1)\.
- T\. Mikolov, K\. Chen, G\. Corrado, and J\. Dean \(2013\)Efficient estimation of word representations in vector space\.arXiv preprint arXiv:1301\.3781\.Cited by:[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.SSS0.Px2.p2.11)\.
- B\. K\. Mlodozeniec, R\. Eschenhagen, J\. Bae, A\. Immer, D\. Krueger, and R\. E\. Turner \(2025\)Influence functions for scalable data attribution in diffusion models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=esYrEndGsr)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1)\.
- H\. Mobahi, M\. Farajtabar, and P\. Bartlett \(2020\)Self\-distillation amplifies regularization in hilbert space\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 3351–3361\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/2288f691b58edecadcc9a8691762b4fd-Paper.pdf)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1)\.
- G\. L\. Nemhauser, L\. A\. Wolsey, and M\. L\. Fisher \(1978\)An analysis of approximations for maximizing submodular set functions—i\.Mathematical programming14\(1\),pp\. 265–294\.Cited by:[§4\.3](https://arxiv.org/html/2606.13732#S4.SS3.p6.5)\.
- G\. Peyré, M\. Cuturi,et al\.\(2019\)Computational optimal transport: with applications to data science\.Foundations and Trends® in Machine Learning11\(5\-6\),pp\. 355–607\.Cited by:[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.p1.9),[Property 2](https://arxiv.org/html/2606.13732#Thmproperty2.p1.10.4)\.
- O\. Pooladzandi, P\. Khosravi, E\. Nijkamp, and B\. Mirzasoleiman \(2022\)Generating high fidelity synthetic data via coreset selection and entropic regularization\.InNeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research,External Links:[Link](https://openreview.net/forum?id=4mm9w-MeOjr)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p2.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p5.1)\.
- X\. Qiao, N\. Ding, Y\. Cheng, and M\. Zhang \(2026\)Beyond binary erasure: soft\-weighted unlearning for fairness and robustness\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 24936–24944\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/39681)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1)\.
- X\. Qiao, M\. Zhang, M\. Tang, and E\. Wei \(2025\)Hessian\-free online certified unlearning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=C3TrHWanh5)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1)\.
- A\. Rakotomamonjy, K\. Nadjahi, and L\. Ralaivola \(2024\)Federated wasserstein distance\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rsg1mvUahT)Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1),[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p2.1),[§B\.7](https://arxiv.org/html/2606.13732#A2.SS7.1.p1.2),[§B\.7](https://arxiv.org/html/2606.13732#A2.SS7.3.p3.6),[§B\.7](https://arxiv.org/html/2606.13732#A2.SS7.5.p5.4),[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.SSS0.Px1.p1.1),[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.p3.1),[§4\.3](https://arxiv.org/html/2606.13732#S4.SS3.p5.10),[Property 2](https://arxiv.org/html/2606.13732#Thmproperty2),[Remark 2](https://arxiv.org/html/2606.13732#Thmremark2.p1.1.1),[Theorem 4](https://arxiv.org/html/2606.13732#Thmtheorem4.3.3)\.
- I\. Redko, A\. Habrard, and M\. Sebban \(2017\)Theoretical analysis of domain adaptation with optimal transport\.InMachine Learning and Knowledge Discovery in Databases,M\. Ceci, J\. Hollmén, L\. Todorovski, C\. Vens, and S\. Džeroski \(Eds\.\),Cham,pp\. 737–753\.External Links:ISBN 978\-3\-319\-71246\-8Cited by:[§3\.3](https://arxiv.org/html/2606.13732#S3.SS3.p1.1)\.
- P\. Rezaei, F\. Kovačević, F\. Locatello, and M\. Mondelli \(2026\)High\-dimensional analysis of synthetic data selection\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Y54P2BBPPh)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p5.1),[1st item](https://arxiv.org/html/2606.13732#A3.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p1.2),[Table 1](https://arxiv.org/html/2606.13732#S5.T1.15.13.1),[§6](https://arxiv.org/html/2606.13732#S6.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p2.3)\.
- H\. Robbins and D\. Siegmund \(1971\)A convergence theorem for non negative almost supermartingales and some applications\.InOptimizing methods in statistics,pp\. 233–257\.Cited by:[Lemma 2](https://arxiv.org/html/2606.13732#Thmlemma2)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-net: convolutional networks for biomedical image segmentation\.InInternational Conference on Medical image computing and computer\-assisted intervention,pp\. 234–241\.Cited by:[2nd item](https://arxiv.org/html/2606.13732#A3.I3.i2.p1.4)\.
- A\. Ruhe \(1970\)Perturbation bounds for means of eigenvalues and invariant subspaces\.BIT Numerical Mathematics10\(3\),pp\. 343–354\.Cited by:[Lemma 4](https://arxiv.org/html/2606.13732#Thmlemma4)\.
- R\. Schaeffer, J\. Kazdan, A\. C\. Arulandu, and S\. Koyejo \(2025\)Position: model collapse does not mean what you think\.arXiv preprint arXiv:2503\.03150\.Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p1.1),[footnote 1](https://arxiv.org/html/2606.13732#footnote1)\.
- L\. Shi, M\. Wu, H\. Zhang, Z\. Zhang, M\. Tao, and Q\. Qu \(2025\)A closer look at model collapse: from a generalization\-to\-memorization perspective\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=6xCcjYa97j)Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p3.1),[§C\.2\.3](https://arxiv.org/html/2606.13732#A3.SS2.SSS3.Px1.p1.1),[§1](https://arxiv.org/html/2606.13732#S1.p2.1),[3rd item](https://arxiv.org/html/2606.13732#S2.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p1.2),[§6](https://arxiv.org/html/2606.13732#S6.p4.4)\.
- A\. Shidani, T\. Farghly, Y\. Sun, H\. Ganjgahi, and G\. Deligiannidis \(2025\)Beyond real data: synthetic data through the lens of regularization\.arXiv preprint arXiv:2510\.08095\.Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p1.1),[§6](https://arxiv.org/html/2606.13732#S6.p1.1)\.
- A\. Shoshan, N\. Bhonker, I\. Kviatkovsky, M\. Fintz, and G\. Medioni \(2023\)Synthetic data for model selection\.InInternational Conference on Machine Learning,pp\. 31633–31656\.Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p4.1),[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p5.1)\.
- I\. Shumailov, Z\. Shumaylov, Y\. Zhao, Y\. Gal, N\. Papernot, and R\. J\. Anderson \(2023\)The curse of recursion: training on generated data makes models forget\.CoRRabs/2305\.17493\.External Links:[Link](https://doi.org/10.48550/arXiv.2305.17493)Cited by:[1st item](https://arxiv.org/html/2606.13732#S2.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px1.p1.4)\.
- I\. Shumailov, Z\. Shumaylov, Y\. Zhao, N\. Papernot, R\. Anderson, and Y\. Gal \(2024\)AI models collapse when trained on recursively generated data\.Nature631\(8022\),pp\. 755–759\.Cited by:[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p2.1),[§A\.1](https://arxiv.org/html/2606.13732#A1.SS1.p3.1),[§B\.2](https://arxiv.org/html/2606.13732#A2.SS2.SSS0.Px1.p1.10),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.13732#S2.SS0.SSS0.Px2.p2.6),[§3\.2](https://arxiv.org/html/2606.13732#S3.SS2.p3.4),[Proposition 1](https://arxiv.org/html/2606.13732#Thmproposition1),[footnote 1](https://arxiv.org/html/2606.13732#footnote1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=St1giarCHLP)Cited by:[2nd item](https://arxiv.org/html/2606.13732#A3.I3.i2.p1.4)\.
- B\. Sorscher, R\. Geirhos, S\. Shekhar, S\. Ganguli, and A\. S\. Morcos \(2022\)Beyond neural scaling laws: beating power law scaling via data pruning\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=UmvSlP-PyV)Cited by:[§A\.2](https://arxiv.org/html/2606.13732#A1.SS2.p2.1)\.
- R\. Taori and T\. Hashimoto \(2023\)Data feedback loops: model\-driven amplification of dataset biases\.InInternational Conference on Machine Learning,pp\. 33883–33920\.Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§C\.5](https://arxiv.org/html/2606.13732#A3.SS5.SSS0.Px2.p1.1)\.
- B\. van Breugel, H\. Sun, Z\. Qian, and M\. van der Schaar \(2023\)Membership inference attacks against synthetic data through overfitting detection\.InProceedings of The 26th International Conference on Artificial Intelligence and Statistics,F\. Ruiz, J\. Dy, and J\. van de Meent \(Eds\.\),Proceedings of Machine Learning Research, Vol\.206,pp\. 3493–3514\.External Links:[Link](https://proceedings.mlr.press/v206/breugel23a.html)Cited by:[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p4.4)\.
- C\. Villaniet al\.\(2008\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[§A\.3](https://arxiv.org/html/2606.13732#A1.SS3.p1.1),[§C\.1](https://arxiv.org/html/2606.13732#A3.SS1.p2.1)\.
- X\. Wei and X\. Zhang \(2025\)Self\-consuming generative models with adversarially curated data\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=UWWNxyIT1h)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p3.1),[§1](https://arxiv.org/html/2606.13732#S1.p5.1),[§3\.1](https://arxiv.org/html/2606.13732#S3.SS1.p6.2),[§4\.1](https://arxiv.org/html/2606.13732#S4.SS1.p1.4)\.
- S\. Wyllie, I\. Shumailov, and N\. Papernot \(2024\)Fairness feedback loops: training on synthetic data amplifies bias\.InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency,pp\. 2113–2147\.Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p1.1)\.
- B\. Yi, Q\. Liu, Y\. Cheng, and H\. Xu \(2026\)Escaping model collapse via synthetic data verification: near\-term improvements and long\-term convergence\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yfk6c39omW)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p2.1),[§1](https://arxiv.org/html/2606.13732#S1.p3.1)\.
- Y\. Zhou, Z\. Liang, H\. Liu, W\. Yu, K\. Panaganti, L\. Song, D\. Yu, X\. Zhang, H\. Mi, and D\. Yu \(2025\)On the evolution of language models without labels: majority drives selection, novelty promotes variation\.InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025,External Links:[Link](https://openreview.net/forum?id=eU9fZMrrAL)Cited by:[§1](https://arxiv.org/html/2606.13732#S1.p1.1)\.
Table of Contents
Figure 6:How latent\-space distributions evolve over the course of training using generated data\. As seen in the Gaussian instance, the model tends to ignore extreme values \(tails\) and gravitate toward the mean\.## Appendix ARelated Work
### A\.1Model Collapse
The phenomenon of model collapse is not a monolith but rather a spectrum of pathological behaviors that have been analyzed from different theoretical perspectives\. A significant body of literature characterizes collapse primarily through the lens of prediction error or population risk on the original data distribution\. Within this framework, collapse is often defined as a catastrophic, nonlinear degradation of model performance, rendering the model functionally useless after a limited number of recursive generations\(Schaefferet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib2)\)\. A more rigorous mathematical formulation extends this to the asymptotic divergence of the test loss, describing a scenario where the model drifts indefinitely away from the ground\-truth manifold\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1); Dohmatobet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib75)\)\. While these risk\-based definitions provide a high\-level signal of failure, they tend to obscure the underlying geometric deformations occurring within the learned distribution, leading to recent calls for more granular definitions based on distribution shifts\(Schaefferet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib2)\)\.
Complementary to risk\-based metrics, a distinct stream of research focuses on the statistical degeneration of the generated data’s topology\. This perspective characterizes collapse as a transition from generalization to memorization, where the model’s output diversity shrinks significantly\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5)\)\. As shown in[Figure˜6](https://arxiv.org/html/2606.13732#A0.F6), the core of this phenomenon isVariance Collapse, where recursive training causes the variance of the learned distribution to contract asymptotically toward zero, effectively reducing the distribution to a point estimate\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14)\)\. This process is frequently preceded by the erosion of distributional tails, where low\-probability events \(the “long tail”\) are statistically washed away, leaving only a homogenized core\(Hatayaet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib78); Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1)\)\. In multi\-modal settings, this manifests as mode dropping or mode entanglement, simplifying the complex data manifold into a few high\-density regions\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14)\)\. Importantly, such tail erosion is not merely a statistical artifact: it can have uneven social and linguistic consequences\.Jarviset al\.\([2026](https://arxiv.org/html/2606.13732#bib.bib92)\)argue that low\-resource languages and marginalized communities often occupy precisely these underrepresented regions of the data distribution, and thus may experience model collapse earlier and more severely than high\-resource communities\. This perspective reframes collapse as a distributional inequity problem, where the loss of rare modes corresponds to the erasure of culturally, linguistically, or demographically underrepresented content\.
To understand the mechanics of this decay, recent works have analyzed the interplay between loss functions and high\-dimensional geometry\. Theoretical underpinnings attribute variance collapse to the smoothing nature of standard loss functions \(e\.g\., MSE or cross\-entropy\), which encourage models to approximate the conditional mean of the data and act as low\-pass filters that dampen high\-frequency variations\(Mobahiet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib76)\)\. This effect is exacerbated in high\-dimensional regimes\. As the dimensionality increases, the volume of the distribution support grows exponentially, making the tails increasingly difficult to cover with synthetic samples\. Consequently, the covariance shift between the target distribution and the synthetic approximation becomes the dominant factor in generalization error, accelerating the collapse process\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)\. The severity of these pathological behaviors is intrinsically linked to the training paradigm employed\. Theoretical analyses typically distinguish betweenReplaceandAccumulatestrategies\. Under theReplaceparadigm, where training data is entirely substituted by synthetic samples at each generation, variance collapse is theoretically proven to be inevitable regardless of model capacity\(Shumailovet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib1); Dohmatobet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib75); Mobahiet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib76)\)\. Conversely, theAccumulateparadigm \(where synthetic data is appended to historical real data\) has been shown to offer asymptotic stability\(Kazdanet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib3)\)\. Empirical studies suggest that maintaining access to the original data \(or a high\-quality subset\) anchors the distribution, preventing the unbounded drift of the mean and variance\(Gerstgrasseret al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib20)\)\.
### A\.2Data Selection
Recognizing the inevitability of model collapse under naive recursive training, the research community has pivoted towarddata selection\(Dohmatobet al\.,[2025b](https://arxiv.org/html/2606.13732#bib.bib83)\)as a primary mitigation strategy\. Existing approaches can be broadly categorized into three methodological families: fidelity\-based filtering, diversity\-preserving alignment, and oracle\-guided verification\.
The intuitive approach involves curating synthetic data based on scalar quality metrics, such as perplexity, log\-likelihood, or confidence scores\. Classic strategies employ truncation or rejection sampling to discard samples that deviate significantly from the model’s high\-density regions\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14)\)\. More recent advances leveragecoreset selectiontechniques, aiming to identify a minimal subset of synthetic data that approximates the gradient or loss landscape of the full distribution\(Pooladzandiet al\.,[2022](https://arxiv.org/html/2606.13732#bib.bib79)\)\. While empirical studies show that pruning data based on hardness or quality metrics can beat power\-law scaling\(Sorscheret al\.,[2022](https://arxiv.org/html/2606.13732#bib.bib80)\), these quality\-centric methods often inadvertently accelerateVariance Collapse\. By systematically favoring safe modal samples\(Maiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib91)\), human preferences reinforce the very homogenization that characterizes the collapse regime\(Alemohammadet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib14)\)\.
To counteract variance reduction, a second stream of research focuses on maximizing distributional coverage\. From an information\-theoretic perspective,\(Shiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib5)\)identify a critical phase transition from generalization to memorization, proposing entropy thresholds to maintain manifold richness\. Geometrically,\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)prove that for linear models, generalization error is dominated by covariance shift, establishingCovariance Matchingas an asymptotically optimal strategy\. Moreover, recent works have adoptedInfluence Functionsto quantify the causal effect of individual samples on reasoning capabilities\(Mlodozeniecet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib86); Linet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib87); Qiaoet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib93); Humaneet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib81); Qiaoet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib94)\)\.
The theoretically grounded line of inquiry posits that scaling up with synthetic data is only viable given a reliable verification mechanism\.\(Fenget al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib4)\)formalized this hypothesis, demonstrating that access to a high\-precision verifier \(e\.g\., an external oracle\) allows synthetic data training to surpass baselines trained solely on real data\. Similarly,\(Shoshanet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib82)\)explore using generative models to synthesize validation sets for model selection, contingent on the ability to calibrate synthetic error against real\-domain risk\. Under this framework, the efficacy of selection is strictly bounded by the precision and recall of the external supervisor relative to the ground truth\.
Research Gap: The Absence of Global Verification\.A critical, often unstated assumption unifies the diverse strategies discussed above: they all presume access to a centralized “ground truth” or global statistics\. Specifically, coreset selection\(Pooladzandiet al\.,[2022](https://arxiv.org/html/2606.13732#bib.bib79)\)and covariance matching\(Rezaeiet al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib6)\)require access to the full target distribution to compute gradients or statistics, while verification frameworks\(Fenget al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib4); Shoshanet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib82)\)assume an oracle capable of distinguishing real from synthetic distributions\. In real\-world decentralized deployments, such as fragmented data silos, this assumption is fundamentally invalid\. Local entities possess only partial, biased views of the global distribution\. Consequently, applying existing selection strategies locally leads to local selection bias\.
### A\.3Optimal Transport
Optimal Transport \(OT\) provides a rigorous geometric framework for comparing probability distributions, offering distinct advantages over statistical divergences by capturing the underlying metric space structure\(Villani and others,[2008](https://arxiv.org/html/2606.13732#bib.bib34)\)\. While computational advancements such as Sinkhorn iterations\(Cuturi,[2013](https://arxiv.org/html/2606.13732#bib.bib84)\)have enabled OT applications in high\-dimensional machine learning, standard algorithms require centralized access to raw samples, rendering them inapplicable in privacy\-sensitive environments\(Maiet al\.,[2025](https://arxiv.org/html/2606.13732#bib.bib90)\)\. To address these constraints, recent scholarship has extended OT to decentralized settings\. In the context ofmetric estimation,\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\)proposed the federated Wasserstein Distance, utilizing geodesic interpolants to approximate transport costs across disparate clients without exchanging raw data\. The most closely related work to ours isLiet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib31)\), which inspires our study by leveraging Wasserstein barycenters for data valuation in FL\. From anoptimizationperspective,\(Dvurechenskiiet al\.,[2018](https://arxiv.org/html/2606.13732#bib.bib85)\)established distributed algorithms for computing barycenters over networks via consensus protocols\. Regardingprivacy,\(Lê Tienet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib23)\)introduced Differentially Private OT, employing randomized projections to optimize transportation plans under privacy budgets\.
A fundamental distinction exists between these prior works and our proposed framework\. Existing literature utilizes distributed OT primarily formetric estimation\(accurately estimating the global𝕎2\\mathbb\{W\}\_\{2\}distance\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\)\) oradaptation\(learning a mapping between source and target domains\(Lê Tienet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib23)\)\)\. In contrast, our work repurposes the OT geometry forgradient\-based data selection\. By computing the calibrated gradient of the Wasserstein distance with respect to individual synthetic samples, we quantify the marginal contribution of each sample to ground\-truth manifold alignment\. This transforms distributed OT from a passive metric into an active selection mechanism\.
## Appendix BProof
### B\.1Auxiliary Lemmata
Before presenting the main proofs, we introduce several key technical lemmas required for our analysis, including[Lemmas˜3](https://arxiv.org/html/2606.13732#Thmlemma3)and[4](https://arxiv.org/html/2606.13732#Thmlemma4), which are used in the proof of[Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)\.
###### Lemma 1\.
Consider a selection mechanism on a multivariate normal distribution𝒩\(𝛍,𝚺\)\\mathcal\{N\}\(\\bm\{\\mu\},\\bm\{\\Sigma\}\)defined by a fixed, locally concave utility functionU\(𝐱\)U\(\\mathbf\{x\}\)maximized at𝐮∗\\mathbf\{u\}^\{\*\}\. Letℛ\\mathcal\{R\}be the high\-utility selection region preserving probability massα∈\(0,1\)\\alpha\\in\(0,1\)\. Let𝚫≜𝔼\[𝐱∣𝐱∈ℛ\]−𝛍\\bm\{\\Delta\}\\triangleq\\mathbb\{E\}\[\\mathbf\{x\}\\mid\\mathbf\{x\}\\in\\mathcal\{R\}\]\-\\bm\{\\mu\}be the mean drift vector\. Assuming𝛍\\bm\{\\mu\}is in a local neighborhood of𝐮∗\\mathbf\{u\}^\{\*\}, there exists a constantκ\>0\\kappa\>0such that the drift acts as a restoring force in the standardized \(Mahalanobis\) metric:\(𝝁−𝐮∗\)⊤𝚺−1𝚫≤−κ‖𝝁−𝐮∗‖𝚺−12\(\\bm\{\\mu\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\bm\{\\Sigma\}^\{\-1\}\\bm\{\\Delta\}\\leq\-\\kappa\\\|\\bm\{\\mu\}\-\\mathbf\{u\}^\{\*\}\\\|\_\{\\bm\{\\Sigma\}^\{\-1\}\}^\{2\}\(27\)###### Proof\.
We analyze the drift by strictly operating in the standardized isotropic frame to avoid the non\-commutativity issues of matrix products in the parameter space\.
1\. Transformation to Standardized Space:Let𝐳=𝚺−1/2\(𝐱−𝝁\)\\mathbf\{z\}=\\bm\{\\Sigma\}^\{\-1/2\}\(\\mathbf\{x\}\-\\bm\{\\mu\}\)be the whitening transformation\. The selection regionℛ\\mathcal\{R\}corresponds to a region𝒟\(𝐜\)\\mathcal\{D\}\(\\mathbf\{c\}\)in the standardized space, where𝐜=𝚺−1/2\(𝝁−𝐮∗\)\\mathbf\{c\}=\\bm\{\\Sigma\}^\{\-1/2\}\(\\bm\{\\mu\}\-\\mathbf\{u\}^\{\*\}\)is the standardized error vector\.
The mean drift vector in the parameter space is given by𝚫=𝚺1/2𝒂\(𝐜\)\\bm\{\\Delta\}=\\bm\{\\Sigma\}^\{1/2\}\\bm\{a\}\(\\mathbf\{c\}\), where𝒂\(𝐜\)=𝔼\[𝐳∣𝐳∈𝒟\(𝐜\)\]\\bm\{a\}\(\\mathbf\{c\}\)=\\mathbb\{E\}\[\\mathbf\{z\}\\mid\\mathbf\{z\}\\in\\mathcal\{D\}\(\\mathbf\{c\}\)\]is the drift in the standardized frame\. We aim to show that𝐜\\mathbf\{c\}and𝒂\(𝐜\)\\bm\{a\}\(\\mathbf\{c\}\)are opposed\.
2\. Gradient Analysis \(Strict Contraction\):The Jacobian of the standardized drift is related to the conditional covariance:
∇𝐜𝒂\(𝐜\)=Cov\(𝐳∣𝐳∈𝒟\(𝐜\)\)−𝐈d\.\\nabla\_\{\\mathbf\{c\}\}\\bm\{a\}\(\\mathbf\{c\}\)=\\text\{Cov\}\(\\mathbf\{z\}\\mid\\mathbf\{z\}\\in\\mathcal\{D\}\(\\mathbf\{c\}\)\)\-\\mathbf\{I\}\_\{d\}\.\(28\)Since the utility functionU\(𝐱\)U\(\\mathbf\{x\}\)is locally concave, the selection induces a log\-concave constraint\. By applying the Brascamp\-Lieb inequality \(or properties of log\-concave measures\), the conditional covariance is strictly contracted relative to the identity matrix\. Thus, there existsκ\>0\\kappa\>0such that:
Cov\(𝐳∣𝐳∈𝒟\(𝐜\)\)⪯\(1−κ\)𝐈d⟹∇𝐜𝒂\(𝐜\)⪯−κ𝐈d\.\\text\{Cov\}\(\\mathbf\{z\}\\mid\\mathbf\{z\}\\in\\mathcal\{D\}\(\\mathbf\{c\}\)\)\\preceq\(1\-\\kappa\)\\mathbf\{I\}\_\{d\}\\implies\\nabla\_\{\\mathbf\{c\}\}\\bm\{a\}\(\\mathbf\{c\}\)\\preceq\-\\kappa\\mathbf\{I\}\_\{d\}\.\(29\)
3\. Inner Product Bound:Instead of approximating the vector𝒂\(𝐜\)\\bm\{a\}\(\\mathbf\{c\}\), we apply the Mean Value Theorem directly to the inner product𝐜⊤𝒂\(𝐜\)\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)\. Note that𝒂\(𝟎\)=𝟎\\bm\{a\}\(\\mathbf\{0\}\)=\\mathbf\{0\}due to the local symmetry around the optimum\.
𝐜⊤𝒂\(𝐜\)=𝐜⊤\(∫01∇𝒂\(t𝐜\)𝑑t\)𝐜\.\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)=\\mathbf\{c\}^\{\\top\}\\left\(\\int\_\{0\}^\{1\}\\nabla\\bm\{a\}\(t\\mathbf\{c\}\)dt\\right\)\\mathbf\{c\}\.\(30\)Since the Jacobian is uniformly bounded by−κ𝐈d\-\\kappa\\mathbf\{I\}\_\{d\}, the quadratic form is bounded as:
𝐜⊤𝒂\(𝐜\)≤−κ‖𝐜‖2\.\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)\\leq\-\\kappa\\\|\\mathbf\{c\}\\\|^\{2\}\.\(31\)
4\. Mapping back to Parameter Space:We substitute the standardized variables back into the Mahalanobis inner product:
\(𝝁−𝐮∗\)⊤𝚺−1𝚫\\displaystyle\(\\bm\{\\mu\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\bm\{\\Sigma\}^\{\-1\}\\bm\{\\Delta\}=\(𝚺1/2𝐜\)⊤𝚺−1\(𝚺1/2𝒂\(𝐜\)\)\\displaystyle=\(\\bm\{\\Sigma\}^\{1/2\}\\mathbf\{c\}\)^\{\\top\}\\bm\{\\Sigma\}^\{\-1\}\(\\bm\{\\Sigma\}^\{1/2\}\\bm\{a\}\(\\mathbf\{c\}\)\)\(32\)=𝐜⊤𝚺1/2𝚺−1𝚺1/2𝒂\(𝐜\)\\displaystyle=\\mathbf\{c\}^\{\\top\}\\bm\{\\Sigma\}^\{1/2\}\\bm\{\\Sigma\}^\{\-1\}\\bm\{\\Sigma\}^\{1/2\}\\bm\{a\}\(\\mathbf\{c\}\)\(33\)=𝐜⊤𝒂\(𝐜\)\\displaystyle=\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)\(34\)≤−κ‖𝐜‖2\\displaystyle\\leq\-\\kappa\\\|\\mathbf\{c\}\\\|^\{2\}\(35\)=−κ‖𝝁−𝐮∗‖𝚺−12\.\\displaystyle=\-\\kappa\\\|\\bm\{\\mu\}\-\\mathbf\{u\}^\{\*\}\\\|\_\{\\bm\{\\Sigma\}^\{\-1\}\}^\{2\}\.\(36\)This confirms that the drift acts as a restoring force in the natural geometry induced by the covariance𝚺\\bm\{\\Sigma\}\. ∎
###### Lemma 2\(Robbins–Siegmund Theorem,Robbins and Siegmund \([1971](https://arxiv.org/html/2606.13732#bib.bib39)\)\)\.
Let\(Ω,ℱ,P\)\(\\Omega,\\mathcal\{F\},P\)be a probability space equipped with a filtration\{ℱn\}n≥0\\\{\\mathcal\{F\}\_\{n\}\\\}\_\{n\\geq 0\}, whereℱn\\mathcal\{F\}\_\{n\}represents the information available up to timenn\. Let\{zn\}n≥0\\\{z\_\{n\}\\\}\_\{n\\geq 0\},\{an\}n≥0\\\{a\_\{n\}\\\}\_\{n\\geq 0\},\{xn\}n≥0\\\{x\_\{n\}\\\}\_\{n\\geq 0\}and\{yn\}n≥0\\\{y\_\{n\}\\\}\_\{n\\geq 0\}be nonnegativeℱn\\mathcal\{F\}\_\{n\}\-measurable random variables\. Assume that, almost surely,𝔼n\[zn\+1∣ℱn\]≤\(1\+an\)zn\+xn−yn,\\mathbb\{E\}\_\{n\}\[z\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]\\leq\(1\+a\_\{n\}\)z\_\{n\}\+x\_\{n\}\-y\_\{n\},\(RS\)and that∑n=0∞an<∞a\.s\.,∑n=0∞xn<∞a\.s\.\\sum\_\{n=0\}^\{\\infty\}a\_\{n\}<\\infty\\quad\\text\{a\.s\.\},\\qquad\\sum\_\{n=0\}^\{\\infty\}x\_\{n\}<\\infty\\quad\\text\{a\.s\.\}\(37\)Then,1\.\{zn\}\\\{z\_\{n\}\\\}converges almost surely to a finite random variablez∞z\_\{\\infty\}\.2\.∑n=0∞yn<∞a\.s\.\\sum\_\{n=0\}^\{\\infty\}y\_\{n\}<\\infty\\quad\\text\{a\.s\.\}###### Lemma 3\(Special Case of equation[RS](https://arxiv.org/html/2606.13732#A2.Ex1),Liuet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib38)\)\)\.
Maintain the setting and notation of Lemma[2](https://arxiv.org/html/2606.13732#Thmlemma2)\. Let\{Tn\}n≥0\\\{T\_\{n\}\\\}\_\{n\\geq 0\}be a deterministic sequence with0<Tn<10<T\_\{n\}<1for allnn\. Letα\>0\\alpha\>0andη\>0\\eta\>0be constants and suppose the sequences are identified asan=0,xn=ηTn2,yn=αTnzna\_\{n\}=0,\\ x\_\{n\}=\\eta T\_\{n\}^\{2\},\\ y\_\{n\}=\\alpha T\_\{n\}z\_\{n\}\. This yields the following recursion:𝔼n\[zn\+1∣ℱn\]≤\(1−αTn\)zn\+ηTn2\.\\mathbb\{E\}\_\{n\}\[z\_\{n\+1\}\\mid\\mathcal\{F\}\_\{n\}\]\\leq\\left\(1\-\\alpha T\_\{n\}\\right\)z\_\{n\}\+\\eta T\_\{n\}^\{2\}\.\(38\)If this recursion satisfies the Robbins–Monro condition∑n=0∞Tn=∞and∑n=0∞Tn2<∞\\sum\_\{n=0\}^\{\\infty\}T\_\{n\}=\\infty\\qquad\\text\{and\}\\qquad\\sum\_\{n=0\}^\{\\infty\}T\_\{n\}^\{2\}<\\infty\(39\)Thenzn→a\.s\.𝟎z\_\{n\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\(40\)###### Lemma 4\(Ruhe’s Trace Inequality,Ruhe \([1970](https://arxiv.org/html/2606.13732#bib.bib40)\)\)\.
Let𝐀\\mathbf\{A\}and𝐁\\mathbf\{B\}ben×nn\\times npositive semidefinite Hermitian matrices with eigenvalues denoted bya1≥⋯≥an≥0a\_\{1\}\\geq\\dots\\geq a\_\{n\}\\geq 0andb1≥⋯≥bn≥0b\_\{1\}\\geq\\dots\\geq b\_\{n\}\\geq 0, respectively\. Then, the trace of their product is bounded by:∑i=1naibn−i\+1≤Tr\(𝐀𝐁\)≤∑i=1naibi\.\\sum\_\{i=1\}^\{n\}a\_\{i\}b\_\{n\-i\+1\}\\leq\\text\{Tr\}\(\\mathbf\{AB\}\)\\leq\\sum\_\{i=1\}^\{n\}a\_\{i\}b\_\{i\}\.\(41\)
### B\.2Proof of[Proposition˜1](https://arxiv.org/html/2606.13732#Thmproposition1)
See[1](https://arxiv.org/html/2606.13732#Thmproposition1)
###### Proof\.
We analyze the two claims separately: the divergence of the Wasserstein distance and the collapse of the variance\.
##### 1\. Wasserstein Divergence \(𝔼\[𝕎22\]→∞\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\]\\to\\infty\)\.
First, observe that the Wasserstein\-2 distance between the approximation𝒩\(𝝁¯t,𝚺¯t\)\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)and the true distribution𝒩0\\mathcal\{N\}\_\{0\}is lower\-bounded by the Euclidean distance of their means:
𝕎22\(𝒩\(𝝁¯t,𝚺¯t\),𝒩0\)≥‖𝝁¯t−𝝁¯0‖2\.\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\),\\mathcal\{N\}\_\{0\}\)\\geq\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\.\(42\)Let𝝁^0\\hat\{\\bm\{\\mu\}\}\_\{0\}and𝚺^0\\hat\{\\bm\{\\Sigma\}\}\_\{0\}be the empirical mean and covariance estimated from the initial samples of𝒩0\\mathcal\{N\}\_\{0\}\. By the triangle inequality:
𝕎22\(𝒩\(𝝁¯t,𝚺¯t\),𝒩0\)\+𝕎22\(𝒩0,𝒩\(𝝁^0,𝚺^0\)\)≥12𝕎22\(𝒩\(𝝁¯t,𝚺¯t\),𝒩\(𝝁^0,𝚺^0\)\)\.\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\),\\mathcal\{N\}\_\{0\}\)\+\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{0\},\\mathcal\{N\}\(\\hat\{\\bm\{\\mu\}\}\_\{0\},\\hat\{\\bm\{\\Sigma\}\}\_\{0\}\)\)\\geq\\frac\{1\}\{2\}\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\),\\mathcal\{N\}\(\\hat\{\\bm\{\\mu\}\}\_\{0\},\\hat\{\\bm\{\\Sigma\}\}\_\{0\}\)\)\.\(43\)Rearranging this yields:
𝕎22\(𝒩\(𝝁¯t,𝚺¯t\),𝒩0\)≥12‖𝝁¯t−𝝁^0‖2−𝕎22\(𝒩0,𝒩\(𝝁^0,𝚺^0\)\)\.\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\),\\mathcal\{N\}\_\{0\}\)\\geq\\frac\{1\}\{2\}\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\hat\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\-\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{0\},\\mathcal\{N\}\(\\hat\{\\bm\{\\mu\}\}\_\{0\},\\hat\{\\bm\{\\Sigma\}\}\_\{0\}\)\)\.\(44\)Under the Replace paradigm, the sampling and fitting process matches the setting ofShumailovet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib1)\)with unbiased estimators\. Referring to the results of Eq\. \(21\) in the Supplementary information ofShumailovet al\.\([2024](https://arxiv.org/html/2606.13732#bib.bib1)\), we have
𝔼\[‖𝝁¯t−𝝁^0‖2\]≥Tr\(𝚺¯0\)∑τ=1t1n\.\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\hat\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]\\geq\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\\sum\_\{\\tau=1\}^\{t\}\\frac\{1\}\{n\}\.\(45\)Since the sample size is constantnnat each generation, the term sums totn\\frac\{t\}\{n\}\. Thus, ast→∞t\\to\\infty:
𝔼\[𝕎22\(𝒩\(𝝁¯t,𝚺¯t\),𝒩0\)\]≥Tr\(𝚺¯0\)2nt−𝕎22\(𝒩0,𝒩\(𝝁^0,𝚺^0\)\)→∞\.\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\),\\mathcal\{N\}\_\{0\}\)\]\\geq\\frac\{\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\}\{2n\}t\-\\mathbb\{W\}\_\{2\}^\{2\}\(\\mathcal\{N\}\_\{0\},\\mathcal\{N\}\(\\hat\{\\bm\{\\mu\}\}\_\{0\},\\hat\{\\bm\{\\Sigma\}\}\_\{0\}\)\)\\rightarrow\\infty\.\(46\)
##### 2\. Variance Collapse \(𝚺¯t→a\.s\.𝟎\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\)\.
The proof relies on the martingale properties of the covariance trace\. Consider the recursive update step where samples are generated as𝐗i,t=𝚺¯t−11/2𝐳i,t\+𝝁¯t−1\\mathbf\{X\}\_\{i,t\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{z\}\_\{i,t\}\+\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}with𝐳i,t∼𝒩\(𝟎,𝐈d\)\\mathbf\{z\}\_\{i,t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\}\)\.
The trace of the covariance matrix,Tr\(𝚺¯t\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\), forms a lower\-bounded supermartingale\. By Doob’s Martingale Convergence Theorem, it must converge to a random variable𝚺∞\\bm\{\\Sigma\}\_\{\\infty\}:
Tr\(𝚺¯t\)⟶𝚺∞\.\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\\longrightarrow\\bm\{\\Sigma\}\_\{\\infty\}\.\(47\)Specifically, the trace evolves multiplicatively asTr\(𝚺¯t\)=QtTr\(𝚺¯t−1\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)=Q\_\{t\}\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\), whereQtQ\_\{t\}is a generalizedχ2\\chi^\{2\}random variable \(with expectation𝔼\[Qt\]=1\\mathbb\{E\}\[Q\_\{t\}\]=1\) representing the sum ofddindependentχ2\\chi^\{2\}variables weighted by the eigenvalues of𝚺¯t−1\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\.
For any generationtt, at least one weight is significant, ensuring thatP\(\|Qt−1\|\>ϵ\)\>c\>0P\(\|Q\_\{t\}\-1\|\>\\epsilon\)\>c\>0for some constantcc\. Consequently, for the productTr\(𝚺¯t\)=Tr\(𝚺¯0\)∏j=1tQj\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\\prod\_\{j=1\}^\{t\}Q\_\{j\}to converge to a finite limit𝚺∞\\bm\{\\Sigma\}\_\{\\infty\}, it must be that𝚺∞=0\\bm\{\\Sigma\}\_\{\\infty\}=0almost surely \(otherwise the fluctuations ofQtQ\_\{t\}would prevent convergence\)\. Since all matrix norms are equivalent,Tr\(𝚺¯t\)→0\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\\to 0implies:
𝚺¯t→a\.s\.𝟎\.\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\.\(48\)∎
### B\.3Proof of[Proposition˜2](https://arxiv.org/html/2606.13732#Thmproposition2)
See[2](https://arxiv.org/html/2606.13732#Thmproposition2)
###### Proof\.
In this section, we provide a rigorous derivation of the convergence of the Accumulate paradigm in the multivariate setting\. We define the recursive process as follows:
Sampling Step:At generationtt, data are sampled from the estimated distribution of the previous generation:
Xi,t=𝝁¯t−1\+𝚺¯t−11/2zi,t,wherezi,t∼𝒩\(0,𝐈d\),i=1,…,nX\_\{i,t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}z\_\{i,t\},\\quad\\text\{where \}z\_\{i,t\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\),\\quad i=1,\\dots,n\(49\)
Learning Step:The parameters are updated using all accumulated historical data from generation0tott:
𝝁¯t\\displaystyle\\bar\{\\bm\{\\mu\}\}\_\{t\}=1tn∑τ=1t∑i=1nXi,τ\\displaystyle=\\frac\{1\}\{tn\}\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{i=1\}^\{n\}X\_\{i,\\tau\}\(50\)𝚺¯t\\displaystyle\\bar\{\\bm\{\\Sigma\}\}\_\{t\}=1tn∑τ=1t∑i=1n\(Xi,τ−𝝁¯t\)\(Xi,τ−𝝁¯t\)⊤\\displaystyle=\\frac\{1\}\{tn\}\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{i=1\}^\{n\}\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}\(51\)
#### B\.3\.1Recursive Relation of the Mean
We first derive the recursive relationship between𝝁¯t\\bar\{\\bm\{\\mu\}\}\_\{t\}and𝝁¯t−1\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\. Decomposing the summation into the history \(generations11tot−1t\-1\) and the current generationtt:
𝝁¯t=1tn\(∑τ=1t−1∑i=1nXi,τ\+∑i=1nXi,t\)\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\frac\{1\}\{tn\}\\left\(\\sum\_\{\\tau=1\}^\{t\-1\}\\sum\_\{i=1\}^\{n\}X\_\{i,\\tau\}\+\\sum\_\{i=1\}^\{n\}X\_\{i,t\}\\right\)\(52\)Observing that the first term sums ton\(t−1\)𝝁¯t−1n\(t\-1\)\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}, and substituting the sampling equation forXi,tX\_\{i,t\}:
𝝁¯t\\displaystyle\\bar\{\\bm\{\\mu\}\}\_\{t\}=1tn\[n\(t−1\)𝝁¯t−1\+∑i=1n\(𝝁¯t−1\+𝚺¯t−11/2zi,t\)\]\\displaystyle=\\frac\{1\}\{tn\}\\left\[n\(t\-1\)\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\sum\_\{i=1\}^\{n\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}z\_\{i,t\}\)\\right\]\(53\)=1tn\[n\(t−1\)𝝁¯t−1\+n𝝁¯t−1\+n𝚺¯t−11/2z¯t\]\\displaystyle=\\frac\{1\}\{tn\}\\left\[n\(t\-1\)\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+n\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+n\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{z\}\_\{t\}\\right\]\(54\)=1tn\[nt𝝁¯t−1\+n𝚺¯t−11/2z¯t\]\\displaystyle=\\frac\{1\}\{tn\}\\left\[nt\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+n\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{z\}\_\{t\}\\right\]\(55\)wherez¯t=1n∑i=1nzi,t∼𝒩\(0,1n𝐈d\)\\bar\{z\}\_\{t\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}z\_\{i,t\}\\sim\\mathcal\{N\}\(0,\\frac\{1\}\{n\}\\mathbf\{I\}\_\{d\}\)\. This yields the recurrence:
𝝁¯t=𝝁¯t−1\+𝚺¯t−11/2z¯tt\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{t\}\}\{t\}\(56\)Unrolling this recurrence leads to the general form:
𝝁¯t=𝝁¯0\+∑k=1t𝚺¯k−11/2z¯kk\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\bar\{\\bm\{\\mu\}\}\_\{0\}\+\\sum\_\{k=1\}^\{t\}\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{k\}\}\{k\}\(57\)Taking the expectation, since𝔼\[z¯k\]=0\\mathbb\{E\}\[\\bar\{z\}\_\{k\}\]=0, we have𝔼\[𝝁¯t\]=𝝁¯0\\mathbb\{E\}\[\\bar\{\\bm\{\\mu\}\}\_\{t\}\]=\\bar\{\\bm\{\\mu\}\}\_\{0\}\.
#### B\.3\.2Recursive Relation of the Covariance
We aim to find the expected covariance matrix𝔼\[𝚺¯t\]\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\]\. Using the law of total expectation, we first compute𝔼\[𝚺¯t\|ℱt−1\]\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]\. Using the multivariate decomposition identity∑iN\(x−c\)\(x−c\)⊤=∑iN\(x−x¯\)\(x−x¯\)⊤\+N\(x¯−c\)\(x¯−c\)⊤\\sum\_\{i\}^\{N\}\(x\-c\)\(x\-c\)^\{\\top\}=\\sum\_\{i\}^\{N\}\(x\-\\bar\{x\}\)\(x\-\\bar\{x\}\)^\{\\top\}\+N\(\\bar\{x\}\-c\)\(\\bar\{x\}\-c\)^\{\\top\}, we decompose the total covariance intoWithin\-Group Covariance\(A\) andBetween\-Group Deviation\(B\):
nt𝚺¯t=∑τ=1t\[∑i=1n\(Xi,τ−X¯τ\)\(Xi,τ−X¯τ\)⊤⏟Term Aτ:Within\+∑i=1n\(X¯τ−𝝁¯t\)\(X¯τ−𝝁¯t\)⊤⏟Term Bτ: Between\]nt\\bar\{\\bm\{\\Sigma\}\}\_\{t\}=\\sum\_\{\\tau=1\}^\{t\}\\left\[\\underbrace\{\\sum\_\{i=1\}^\{n\}\(X\_\{i,\\tau\}\-\\bar\{X\}\_\{\\tau\}\)\(X\_\{i,\\tau\}\-\\bar\{X\}\_\{\\tau\}\)^\{\\top\}\}\_\{\\text\{Term A\}\_\{\\tau\}:\\text\{Within\}\}\+\\underbrace\{\\sum\_\{i=1\}^\{n\}\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}\}\_\{\\text\{Term B\}\_\{\\tau\}\\text\{: Between\}\}\\right\]\(58\)whereX¯τ\\bar\{X\}\_\{\\tau\}is the sample mean of generationτ\\tau\. SubstitutingXi,τ=𝝁¯τ−1\+𝚺¯τ−11/2zi,τX\_\{i,\\tau\}=\\bar\{\\bm\{\\mu\}\}\_\{\\tau\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}z\_\{i,\\tau\}, we haveX¯τ=𝝁¯τ−1\+𝚺¯τ−11/2z¯τ\\bar\{X\}\_\{\\tau\}=\\bar\{\\bm\{\\mu\}\}\_\{\\tau\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}\\bar\{z\}\_\{\\tau\}\. The within\-group term simplifies usingSτ=∑i=1n\(zi,τ−z¯τ\)\(zi,τ−z¯τ\)⊤S\_\{\\tau\}=\\sum\_\{i=1\}^\{n\}\(z\_\{i,\\tau\}\-\\bar\{z\}\_\{\\tau\}\)\(z\_\{i,\\tau\}\-\\bar\{z\}\_\{\\tau\}\)^\{\\top\}:
Term Aτ=𝚺¯τ−11/2Sτ\(𝚺¯τ−11/2\)⊤\\text\{Term A\}\_\{\\tau\}=\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}S\_\{\\tau\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}\)^\{\\top\}\(59\)We calculate the conditional expectation by splitting the summation into historical terms \(τ<t\\tau<t\) and the new term \(τ=t\\tau=t\)\.
Step 1: Historical Contribution \(τ<t\\tau<t\)\.Forτ<t\\tau<t,𝚺¯τ−1\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}is fixed inℱt−1\\mathcal\{F\}\_\{t\-1\}\.
- •Within Covariance:𝔼\[Term Aτ\]=\(n−1\)𝚺¯τ−1\\mathbb\{E\}\[\\text\{Term A\}\_\{\\tau\}\]=\(n\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}\.
- •Between Deviation:We express the deviation vector: X¯τ−𝝁¯t=\(X¯τ−𝝁¯t−1\)−\(𝝁¯t−𝝁¯t−1\)=\(X¯τ−𝝁¯t−1\)−𝚺¯t−11/2z¯tt\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}=\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)\-\(\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)=\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)\-\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{t\}\}\{t\}\(60\)Squaring this \(outer product\) and taking the expectation, the cross\-term vanishes due to the independence ofz¯t\\bar\{z\}\_\{t\}from historical data\. The random part contributes: 𝔼\[n\(𝚺¯t−11/2z¯tt\)\(𝚺¯t−11/2z¯tt\)⊤\]=nt2𝚺¯t−11/2𝐈dn\(𝚺¯t−11/2\)⊤=1t2𝚺¯t−1\\mathbb\{E\}\\left\[n\\left\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{t\}\}\{t\}\\right\)\\left\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{t\}\}\{t\}\\right\)^\{\\top\}\\right\]=\\frac\{n\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\mathbf\{I\}\_\{d\}\}\{n\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}=\\frac\{1\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(61\)Thus, the total expectation for one historical stepτ\\tauis: 𝔼\[Term Aτ\]=𝔼\[∑i=1n\(Xi,τ−𝝁¯t\)\(Xi,τ−𝝁¯t\)⊤\]=n𝚺¯t−1\+1t2𝚺¯t−1\\mathbb\{E\}\[\\text\{Term A\}\_\{\\tau\}\]=\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{n\}\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}\\right\]=n\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\+\\frac\{1\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(62\)Summing over thet−1t\-1historical steps and using the definition of the previous total covariance𝚺¯t−1\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}: ∑τ=1t−1𝔼\[Term Aτ\]=∑τ=1t−1𝔼\[∑i=1n\(Xi,τ−𝝁¯t\)\(Xi,τ−𝝁¯t\)⊤\]=n\(t−1\)𝚺¯t−1\+t−1t2𝚺¯t−1\\sum\_\{\\tau=1\}^\{t\-1\}\\mathbb\{E\}\[\\text\{Term A\}\_\{\\tau\}\]=\\sum\_\{\\tau=1\}^\{t\-1\}\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{n\}\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(X\_\{i,\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}\\right\]=n\(t\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\+\\frac\{t\-1\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(63\)
Step 2: New Data Contribution \(τ=t\\tau=t\)\.For the current stepτ=t\\tau=t:
- •Within Covariance:𝔼\[Term At\]=\(n−1\)𝚺¯t−1\\mathbb\{E\}\[\\text\{Term A\}\_\{t\}\]=\(n\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\.
- •Between Deviation:Using[Eq\.˜56](https://arxiv.org/html/2606.13732#A2.E56), we haveX¯t−𝝁¯t=𝚺¯t−11/2z¯t\(1−1t\)\\bar\{X\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{z\}\_\{t\}\(1\-\\frac\{1\}\{t\}\)\. Taking the expectation: 𝔼\[Term Bt\]=n\(1−1t\)2𝚺¯t−11/2𝐈dn\(𝚺¯t−11/2\)⊤=\(t−1t\)2𝚺¯t−1\\mathbb\{E\}\[\\text\{Term B\}\_\{t\}\]=n\\left\(1\-\\frac\{1\}\{t\}\\right\)^\{2\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\mathbf\{I\}\_\{d\}\}\{n\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}=\\left\(\\frac\{t\-1\}\{t\}\\right\)^\{2\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(64\)
Step 3: AggregationSumming all components:
𝔼\[nt𝚺¯t\|ℱt−1\]\\displaystyle\\mathbb\{E\}\[nt\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=\[n\(t−1\)⏟History Base\+\(t−1\)/t2⏟History Drift\+\(n−1\)⏟New Within\+\(t−1\)2/t2⏟New Between\]𝚺¯t−1\\displaystyle=\\left\[\\underbrace\{n\(t\-1\)\}\_\{\\text\{History Base\}\}\+\\underbrace\{\{\(t\-1\)\}/\{t^\{2\}\}\}\_\{\\text\{History Drift\}\}\+\\underbrace\{\(n\-1\)\}\_\{\\text\{New Within\}\}\+\\underbrace\{\{\(t\-1\)^\{2\}\}/\{t^\{2\}\}\}\_\{\\text\{New Between\}\}\\right\]\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(65\)=\[n\(t−1\)\+n−1\+t−1\+\(t−1\)2t2\]𝚺¯t−1\\displaystyle=\\left\[n\(t\-1\)\+n\-1\+\\frac\{t\-1\+\(t\-1\)^\{2\}\}\{t^\{2\}\}\\right\]\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(66\)=\[nt−1\+t−1t\]𝚺¯t−1=\[nt−1t\]𝚺¯t−1\\displaystyle=\\left\[nt\-1\+\\frac\{t\-1\}\{t\}\\right\]\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}=\\left\[nt\-\\frac\{1\}\{t\}\\right\]\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(67\)Dividing byntnt:
𝔼\[𝚺¯t\|ℱt−1\]=\(1−1nt2\)𝚺¯t−1\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\left\(1\-\\frac\{1\}\{nt^\{2\}\}\\right\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\(68\)Since1−1nt2<11\-\\frac\{1\}\{nt^\{2\}\}<1, we have𝔼\[𝚺¯t\|ℱt−1\]⪯𝚺¯t−1\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]\\preceq\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}, which implies that\{𝚺¯t\}\\\{\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\\}is a nonnegative supermartingale\. By the Martingale Convergence Theorem,𝚺¯t\\bar\{\\bm\{\\Sigma\}\}\_\{t\}converges almost surely to a limiting random matrix𝚺¯∞\\bar\{\\bm\{\\Sigma\}\}\_\{\\infty\}\.
Taking the total expectation recursively yields the infinite product:
𝔼\[𝚺¯t\]=𝚺¯0∏k=1t\(1−1nk2\)\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\]=\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\\prod\_\{k=1\}^\{t\}\\left\(1\-\\frac\{1\}\{nk^\{2\}\}\\right\)\(69\)Ast→∞t\\to\\infty, using Euler’s infinite product formula for the sinc functionsin\(πx\)πx=∏\(1−x2k2\)\\frac\{\\sin\(\\pi x\)\}\{\\pi x\}=\\prod\(1\-\\frac\{x^\{2\}\}\{k^\{2\}\}\), withx=1/nx=1/\\sqrt\{n\}:
limt→∞𝔼\[𝚺¯t\]=sin\(π/n\)π/n𝚺¯0=αn𝚺¯0\\lim\_\{t\\to\\infty\}\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\]=\\frac\{\\sin\(\\pi/\\sqrt\{n\}\)\}\{\\pi/\\sqrt\{n\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}=\\alpha\_\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\(70\)
#### B\.3\.3Wasserstein Distance
The squared 2\-Wasserstein distance between𝒩\(𝝁¯t,𝚺¯t\)\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)and𝒩\(𝝁¯0,𝚺¯0\)\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{0\},\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)is given by:
𝔼\[𝕎22\]=𝔼\[‖𝝁¯t−𝝁¯0‖2\]\+𝔼\[𝔅2\(𝚺¯t,𝚺¯0\)\]\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\]=\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]\+\\mathbb\{E\}\[\\mathfrak\{B\}^\{2\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\]\(71\)
Mean Shift Term:Recall𝝁¯t−𝝁¯0=∑k=1t𝚺¯k−11/2z¯kk\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}=\\sum\_\{k=1\}^\{t\}\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}^\{1/2\}\\frac\{\\bar\{z\}\_\{k\}\}\{k\}\. The termsz¯k\\bar\{z\}\_\{k\}are independent with variance𝐈d/n\\mathbf\{I\}\_\{d\}/n\. Thus:
𝔼\[‖𝝁¯t−𝝁¯0‖2\]=∑k=1t1k2Tr\(1n𝚺¯k−11/2𝐈d\(𝚺¯k−11/2\)⊤\)=∑k=1t1nk2Tr\(𝔼\[𝚺¯k−1\]\)\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]=\\sum\_\{k=1\}^\{t\}\\frac\{1\}\{k^\{2\}\}\\text\{Tr\}\\left\(\\frac\{1\}\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}^\{1/2\}\\mathbf\{I\}\_\{d\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}^\{1/2\}\)^\{\\top\}\\right\)=\\sum\_\{k=1\}^\{t\}\\frac\{1\}\{nk^\{2\}\}\\text\{Tr\}\(\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}\]\)\(72\)Using the product form𝔼\[𝚺¯k−1\]=Pk−1𝚺¯0\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{k\-1\}\]=P\_\{k\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}wherePk=∏τ=1k\(1−1nτ2\)P\_\{k\}=\\prod\_\{\\tau=1\}^\{k\}\(1\-\\frac\{1\}\{n\\tau^\{2\}\}\), we construct a telescoping sum\. Notice thatPk=Pk−1\(1−1nk2\)⟹Pk−1−Pk=1nk2Pk−1P\_\{k\}=P\_\{k\-1\}\(1\-\\frac\{1\}\{nk^\{2\}\}\)\\implies P\_\{k\-1\}\-P\_\{k\}=\\frac\{1\}\{nk^\{2\}\}P\_\{k\-1\}\. Substituting this into the summation:
𝔼\[‖𝝁¯t−𝝁¯0‖2\]\\displaystyle\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]=Tr\(𝚺¯0\)∑k=1t\(Pk−1−Pk\)\\displaystyle=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\\sum\_\{k=1\}^\{t\}\(P\_\{k\-1\}\-P\_\{k\}\)\(73\)=Tr\(𝚺¯0\)\(P0−Pt\)=Tr\(𝚺¯0\)\(1−αn\)\\displaystyle=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(P\_\{0\}\-P\_\{t\}\)=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(1\-\\alpha\_\{n\}\)\(74\)
Covariance Distance Term:Ast→∞t\\to\\infty,𝚺¯t→αn𝚺¯0\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\to\\alpha\_\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}almost surely\. The Bures metric term becomes:
limt→∞𝔅2\(𝚺¯t,𝚺¯0\)\\displaystyle\\lim\_\{t\\to\\infty\}\\mathfrak\{B\}^\{2\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)=Tr\(αn𝚺¯0\+𝚺¯0−2\(αn𝚺¯01/2𝚺¯0𝚺¯01/2\)1/2\)\\displaystyle=\\text\{Tr\}\(\\alpha\_\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\-2\(\\alpha\_\{n\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}^\{1/2\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}^\{1/2\}\)^\{1/2\}\)\(75\)=Tr\(\(1\+αn\)𝚺¯0−2αn𝚺¯0\)\\displaystyle=\\text\{Tr\}\(\(1\+\\alpha\_\{n\}\)\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\-2\\sqrt\{\\alpha\_\{n\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(76\)=\(1\+αn−2αn\)Tr\(𝚺¯0\)=\(1−αn\)2Tr\(𝚺¯0\)\\displaystyle=\(1\+\\alpha\_\{n\}\-2\\sqrt\{\\alpha\_\{n\}\}\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)=\(1\-\\sqrt\{\\alpha\_\{n\}\}\)^\{2\}\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(77\)
Combining both terms:
limt→∞𝔼\[𝕎22\]=\(1−αn\)Tr\(𝚺¯0\)\+\(1−αn\)2Tr\(𝚺¯0\)=2\(1−αn\)Tr\(𝚺¯0\)\\lim\_\{t\\to\\infty\}\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\]=\(1\-\\alpha\_\{n\}\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\+\(1\-\\sqrt\{\\alpha\_\{n\}\}\)^\{2\}\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)=2\(1\-\\sqrt\{\\alpha\_\{n\}\}\)\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(78\)∎
### B\.4Proof of[Theorem˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)
See[1](https://arxiv.org/html/2606.13732#Thmtheorem1)
###### Proof\.
In this section, we provide the complete derivation for the Accumulate paradigm under biased selection \(top\-α\\alphaselection based on a local utilityU\(x\)U\(x\)\)\. We analyze the evolution of the mean vector and the covariance matrix to prove the convergence of the mean and the collapse of the variance\.
Step 1: Sampling and SelectionAt generationtt, we first generate raw samples from the previous estimate:
X~i,t∼𝒩\(𝝁¯t−1,𝚺¯t−1\),i=1,…,n\\tilde\{X\}\_\{i,t\}\\sim\\mathcal\{N\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\),\\quad i=1,\\dots,n\(79\)The selection mechanism retains samples within a high\-utility regionℛt⊂ℝd\\mathcal\{R\}\_\{t\}\\subset\\mathbb\{R\}^\{d\}enclosing the targetu∗u^\{\*\}, such that the acceptance probability isα\\alpha\. The selected samplesXi,tX\_\{i,t\}follow a truncated multivariate normal distribution:
Xi,t∼𝒯𝒩\(𝝁¯t−1,𝚺¯t−1,ℛt\)X\_\{i,t\}\\sim\\mathcal\{TN\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\},\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\},\\mathcal\{R\}\_\{t\}\)\(80\)To facilitate analysis, we utilize the standardized reparameterization\. Let𝒟t−1\\mathcal\{D\}\_\{t\-1\}be the standardized selection region:
𝒟t−1=\{𝐳∈ℝd∣𝝁¯t−1\+𝚺¯t−11/2𝐳∈ℛt\}\\mathcal\{D\}\_\{t\-1\}=\\\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}\\mid\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{z\}\\in\\mathcal\{R\}\_\{t\}\\\}\(81\)The selected samples can be expressed as:
Xi,t=𝝁¯t−1\+𝚺¯t−11/2𝜼i,tX\_\{i,t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bm\{\\eta\}\_\{i,t\}\(82\)where𝜼i,t∼𝒯𝒩\(𝟎,𝐈d,𝒟t−1\)\\bm\{\\eta\}\_\{i,t\}\\sim\\mathcal\{TN\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{d\},\\mathcal\{D\}\_\{t\-1\}\)is a standard truncated multivariate normal vector\.
Step 2: Key MomentsWe characterize the statistics of the standardized truncated noise𝜼i,t\\bm\{\\eta\}\_\{i,t\}:
1. 1\.Mean Drift Vector:𝒂t−1≜𝔼\[𝜼i,t\|ℱt−1\]\\bm\{a\}\_\{t\-1\}\\triangleq\\mathbb\{E\}\[\\bm\{\\eta\}\_\{i,t\}\|\\mathcal\{F\}\_\{t\-1\}\]\. This vector represents the directional force toward the high\-utility region\.
2. 2\.Covariance Contraction Matrix:𝐁t−1≜Cov\(𝜼i,t\|ℱt−1\)\\mathbf\{B\}\_\{t\-1\}\\triangleq\\text\{Cov\}\(\\bm\{\\eta\}\_\{i,t\}\|\\mathcal\{F\}\_\{t\-1\}\)\. Due to truncation to a finite probability massα<1\\alpha<1, the variance strictly contracts, implying𝟎≺𝐁t−1≺𝐈d\\mathbf\{0\}\\prec\\mathbf\{B\}\_\{t\-1\}\\prec\\mathbf\{I\}\_\{d\}\.
3. 3\.Second Moment:𝔼\[𝜼i,t𝜼i,t⊤\|ℱt−1\]=𝐁t−1\+𝒂t−1𝒂t−1⊤\\mathbb\{E\}\[\\bm\{\\eta\}\_\{i,t\}\\bm\{\\eta\}\_\{i,t\}^\{\\top\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\.
We define thedissipation matrix𝚿t−1\\mathbf\{\\Psi\}\_\{t\-1\}as the loss in the second moment compared to the identity matrix \(isotropic noise\):
𝚿t−1≜𝐈d−\(𝐁t−1\+𝒂t−1𝒂t−1⊤\)\\mathbf\{\\Psi\}\_\{t\-1\}\\triangleq\\mathbf\{I\}\_\{d\}\-\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\)\(83\)Note that𝚿t−1\\mathbf\{\\Psi\}\_\{t\-1\}is symmetric by construction, as it is the difference between the identity matrix and the symmetric second moment matrix\.
#### B\.4\.1Recursive Relation of the Mean
The parameter update for the mean in the Accumulate paradigm is:
𝝁¯t=1tn\(∑τ=1t−1∑i=1nXi,τ\+∑i=1nXi,t\)\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\frac\{1\}\{tn\}\\left\(\\sum\_\{\\tau=1\}^\{t\-1\}\\sum\_\{i=1\}^\{n\}X\_\{i,\\tau\}\+\\sum\_\{i=1\}^\{n\}X\_\{i,t\}\\right\)\(84\)Recognizing that the first term isn\(t−1\)𝝁¯t−1n\(t\-1\)\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}and substitutingXi,t=𝝁¯t−1\+𝚺¯t−11/2𝜼i,tX\_\{i,t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bm\{\\eta\}\_\{i,t\}:
𝝁¯t\\displaystyle\\bar\{\\bm\{\\mu\}\}\_\{t\}=1tn\[n\(t−1\)𝝁¯t−1\+∑i=1n\(𝝁¯t−1\+𝚺¯t−11/2𝜼i,t\)\]\\displaystyle=\\frac\{1\}\{tn\}\\left\[n\(t\-1\)\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\sum\_\{i=1\}^\{n\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bm\{\\eta\}\_\{i,t\}\)\\right\]\(85\)=𝝁¯t−1\+𝚺¯t−11/2𝜼¯tt\\displaystyle=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{\\bm\{\\eta\}\}\_\{t\}\}\{t\}\(86\)where𝜼¯t=1n∑i=1n𝜼i,t\\bar\{\\bm\{\\eta\}\}\_\{t\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\bm\{\\eta\}\_\{i,t\}\. Taking the conditional expectation:
𝔼\[𝝁¯t\|ℱt−1\]=𝝁¯t−1\+𝚺¯t−11/2𝒂t−1t\\mathbb\{E\}\[\\bar\{\\bm\{\\mu\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bm\{a\}\_\{t\-1\}\}\{t\}\(87\)
Lyapunov Analysis:To rigorously handle the anisotropy of the covariance, we define the Lyapunov function using theMahalanobis norminduced by the current covariance state\. LetVt=‖𝝁¯t−𝐮∗‖𝚺¯t−1−12V\_\{t\}=\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}\\\|\_\{\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1\}\}^\{2\}\.
Recall the recursive update:𝝁¯t−𝐮∗=\(𝝁¯t−1−𝐮∗\)\+1t𝚺¯t−11/2𝜼¯t\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}=\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)\+\\frac\{1\}\{t\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\. Expanding the quadratic form with respect to the weighting matrix𝐖t−1≜𝚺¯t−1−1\\mathbf\{W\}\_\{t\-1\}\\triangleq\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1\}:
Vt\\displaystyle V\_\{t\}=\(𝝁¯t−𝐮∗\)⊤𝐖t−1\(𝝁¯t−𝐮∗\)\\displaystyle=\(\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\mathbf\{W\}\_\{t\-1\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}\)\(88\)=\(𝝁¯t−1−𝐮∗\)⊤𝐖t−1\(𝝁¯t−1−𝐮∗\)\\displaystyle=\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\mathbf\{W\}\_\{t\-1\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)\+2t\(𝝁¯t−1−𝐮∗\)⊤𝐖t−1𝚺¯t−11/2𝜼¯t\\displaystyle\\quad\+\\frac\{2\}\{t\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\mathbf\{W\}\_\{t\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\+1t2\(𝚺¯t−11/2𝜼¯t\)⊤𝐖t−1\(𝚺¯t−11/2𝜼¯t\)\\displaystyle\\quad\+\\frac\{1\}\{t^\{2\}\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\)^\{\\top\}\\mathbf\{W\}\_\{t\-1\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\)Simplifying the terms using𝐖t−1𝚺¯t−11/2=𝚺¯t−1−1𝚺¯t−11/2=𝚺¯t−1−1/2\\mathbf\{W\}\_\{t\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1/2\}:
Vt=Vt−1\+2t\(𝝁¯t−1−𝐮∗\)⊤𝚺¯t−1−1/2𝜼¯t\+1t2‖𝜼¯t‖2V\_\{t\}=V\_\{t\-1\}\+\\frac\{2\}\{t\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\+\\frac\{1\}\{t^\{2\}\}\\\|\\bar\{\\bm\{\\eta\}\}\_\{t\}\\\|^\{2\}\(89\)Taking the expectation conditioned onℱt−1\\mathcal\{F\}\_\{t\-1\}:
𝔼\[Vt\|ℱt−1\]=Vt−1\+2t\(𝝁¯t−1−𝐮∗\)⊤𝚺¯t−1−1/2𝒂t−1⏟Cross Term\+1t2𝔼\[‖𝜼¯t‖2\]\\mathbb\{E\}\[V\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=V\_\{t\-1\}\+\\underbrace\{\\frac\{2\}\{t\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)^\{\\top\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1/2\}\\bm\{a\}\_\{t\-1\}\}\_\{\\text\{Cross Term\}\}\+\\frac\{1\}\{t^\{2\}\}\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\eta\}\}\_\{t\}\\\|^\{2\}\]\(90\)Now, we apply[Lemma˜1](https://arxiv.org/html/2606.13732#Thmlemma1)\. Let𝐜=𝚺¯t−1−1/2\(𝝁¯t−1−𝐮∗\)\\mathbf\{c\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1/2\}\(\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\-\\mathbf\{u\}^\{\*\}\)be the standardized error\. The cross term becomes:
2t𝐜⊤𝒂\(𝐜\)\.\\frac\{2\}\{t\}\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)\.\(91\)By[Lemma˜1](https://arxiv.org/html/2606.13732#Thmlemma1), the drift satisfies𝐜⊤𝒂\(𝐜\)≤−κ‖𝐜‖2=−κVt−1\\mathbf\{c\}^\{\\top\}\\bm\{a\}\(\\mathbf\{c\}\)\\leq\-\\kappa\\\|\\mathbf\{c\}\\\|^\{2\}=\-\\kappa V\_\{t\-1\}\. Substituting this back yields:
𝔼\[Vt\|ℱt−1\]≤Vt−1\(1−2κt\)\+Kt2\\mathbb\{E\}\[V\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]\\leq V\_\{t\-1\}\\left\(1\-\\frac\{2\\kappa\}\{t\}\\right\)\+\\frac\{K\}\{t^\{2\}\}\(92\)whereK=Tr\(𝐁t−1\+𝒂𝒂⊤\)K=\\text\{Tr\}\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\\bm\{a\}^\{\\top\}\)bounds the noise term\. Since the step sizes satisfy the Robbins\-Monro conditions, applying[Lemma˜3](https://arxiv.org/html/2606.13732#Thmlemma3)yields:
limt→∞∥𝝁¯t−𝐮∗∥𝚺¯t−1−12=0\(a\.s\.\)\\lim\_\{t\\to\\infty\}\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\mathbf\{u\}^\{\*\}\\\|\_\{\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{\-1\}\}^\{2\}=0\\quad\(a\.s\.\)\(93\)Since the covariance matrix𝚺¯t\\bar\{\\bm\{\\Sigma\}\}\_\{t\}remains positive definite and bounded \(its eigenvalues do not diverge\), convergence in the Mahalanobis metric implies convergence in the Euclidean metric\. Thus,𝝁¯t→a\.s\.𝐮∗\\bar\{\\bm\{\\mu\}\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{u\}^\{\*\}\.
#### B\.4\.2Recursive Relation of the Covariance
We now derive the recursion for the covariance matrix𝚺¯t\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\. Using the multivariate decomposition:
nt𝚺¯t=∑τ=1t\[∑i=1n\(Xi,τ−X¯τ\)\(Xi,τ−X¯τ\)⊤⏟Term A: Within\+∑i=1n\(X¯τ−𝝁¯t\)\(X¯τ−𝝁¯t\)⊤⏟Term B: Between\]nt\\bar\{\\bm\{\\Sigma\}\}\_\{t\}=\\sum\_\{\\tau=1\}^\{t\}\\left\[\\underbrace\{\\sum\_\{i=1\}^\{n\}\(X\_\{i,\\tau\}\-\\bar\{X\}\_\{\\tau\}\)\(X\_\{i,\\tau\}\-\\bar\{X\}\_\{\\tau\}\)^\{\\top\}\}\_\{\\text\{Term A: Within\}\}\+\\underbrace\{\\sum\_\{i=1\}^\{n\}\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}\}\_\{\\text\{Term B: Between\}\}\\right\]\(94\)whereX¯τ=𝝁¯τ−1\+𝚺¯τ−11/2𝜼¯τ\\bar\{X\}\_\{\\tau\}=\\bar\{\\bm\{\\mu\}\}\_\{\\tau\-1\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{\\tau\}\. The drift vector in the mean update is denoted as𝚫t=𝚺¯t−11/2𝜼¯tt\\bm\{\\Delta\}\_\{t\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\frac\{\\bar\{\\bm\{\\eta\}\}\_\{t\}\}\{t\}, so𝝁¯t=𝝁¯t−1\+𝚫t\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\+\\bm\{\\Delta\}\_\{t\}\.
Step 1: Historical Contribution \(τ<t\\tau<t\)For historical data,𝚺¯τ−1\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}and𝐁τ−1\\mathbf\{B\}\_\{\\tau\-1\}are fixed\.
- •Within Covariance:The expectation of the sample covariance of truncated variables is scaled by the contraction matrix: 𝔼\[Term Aτ\]=\(n−1\)𝚺¯τ−11/2𝐁τ−1\(𝚺¯τ−11/2\)⊤\\mathbb\{E\}\[\\text\{Term A\}\_\{\\tau\}\]=\(n\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}\\mathbf\{B\}\_\{\\tau\-1\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{\\tau\-1\}^\{1/2\}\)^\{\\top\}\(95\)However, under the logic of the Accumulate paradigm for the total sum, the historical "Within" terms plus the historical "Between" deviations \(relative to thepreviousmean𝝁¯t−1\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\) sum exactly to the previous total covariancen\(t−1\)𝚺¯t−1n\(t\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\. We must only correct for the shift in the global mean from𝝁¯t−1\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}to𝝁¯t\\bar\{\\bm\{\\mu\}\}\_\{t\}\.
- •Between Deviation Correction:The deviation vector is decomposed asX¯τ−𝝁¯t=\(X¯τ−𝝁¯t−1\)−𝚫t\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}=\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)\-\\bm\{\\Delta\}\_\{t\}\. Expanding the outer product sum: ∑τ=1tn\(X¯τ−𝝁¯t\)\(X¯τ−𝝁¯t\)⊤\\displaystyle\\sum\_\{\\tau=1\}^\{t\}n\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}\)^\{\\top\}=∑τ=1tn\(X¯τ−𝝁¯t−1\)\(X¯τ−𝝁¯t−1\)⊤\\displaystyle=\\sum\_\{\\tau=1\}^\{t\}n\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)^\{\\top\}−∑τ=1tn\(X¯τ−𝝁¯t−1\)𝚫t⊤−∑τ=1tn𝚫t\(X¯τ−𝝁¯t−1\)⊤\\displaystyle\\quad\-\\sum\_\{\\tau=1\}^\{t\}n\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)\\bm\{\\Delta\}\_\{t\}^\{\\top\}\-\\sum\_\{\\tau=1\}^\{t\}n\\bm\{\\Delta\}\_\{t\}\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)^\{\\top\}\+∑τ=1tn𝚫t𝚫t⊤\\displaystyle\\quad\+\\sum\_\{\\tau=1\}^\{t\}n\\bm\{\\Delta\}\_\{t\}\\bm\{\\Delta\}\_\{t\}^\{\\top\}\(96\)Note that the cross terms vanish because∑τ=1t−1\(X¯τ−𝝁¯t−1\)=𝟎\\sum\_\{\\tau=1\}^\{t\-1\}\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)=\\mathbf\{0\}by definition of the previous mean\. The last term sums ton\(t−1\)𝚫t𝚫t⊤n\(t\-1\)\\bm\{\\Delta\}\_\{t\}\\bm\{\\Delta\}\_\{t\}^\{\\top\}\. The first part combines with "Within" to formn\(t−1\)𝚺¯t−1n\(t\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\. The middle terms are zero because∑\(X¯τ−𝝁¯t−1\)=0\\sum\(\\bar\{X\}\_\{\\tau\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\-1\}\)=0\. The last term is: n\(t−1\)𝚫t𝚫t⊤=n\(t−1\)1t2𝚺¯t−11/2𝜼¯t𝜼¯t⊤\(𝚺¯t−11/2\)⊤n\(t\-1\)\\bm\{\\Delta\}\_\{t\}\\bm\{\\Delta\}\_\{t\}^\{\\top\}=n\(t\-1\)\\frac\{1\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\\bar\{\\bm\{\\eta\}\}\_\{t\}^\{\\top\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(97\)Taking the expectation of𝜼¯t𝜼¯t⊤\\bar\{\\bm\{\\eta\}\}\_\{t\}\\bar\{\\bm\{\\eta\}\}\_\{t\}^\{\\top\}: 𝔼\[𝜼¯t𝜼¯t⊤\|ℱt−1\]=Cov\(𝜼¯t\)\+𝔼\[𝜼¯t\]𝔼\[𝜼¯t\]⊤=𝐁t−1n\+𝒂t−1𝒂t−1⊤\\mathbb\{E\}\[\\bar\{\\bm\{\\eta\}\}\_\{t\}\\bar\{\\bm\{\\eta\}\}\_\{t\}^\{\\top\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\text\{Cov\}\(\\bar\{\\bm\{\\eta\}\}\_\{t\}\)\+\\mathbb\{E\}\[\\bar\{\\bm\{\\eta\}\}\_\{t\}\]\\mathbb\{E\}\[\\bar\{\\bm\{\\eta\}\}\_\{t\}\]^\{\\top\}=\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\(98\)Thus, the correction from history due to drift is: n\(t−1\)t2𝚺¯t−11/2\(𝐁t−1n\+𝒂t−1𝒂t−1⊤\)\(𝚺¯t−11/2\)⊤\\frac\{n\(t\-1\)\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\left\(\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\right\)\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(99\)
Step 2: New Data Contribution \(τ=t\\tau=t\)
- •Within Covariance:𝔼\[Term At\]=\(n−1\)𝚺¯t−11/2𝐁t−1\(𝚺¯t−11/2\)⊤\\mathbb\{E\}\[\\text\{Term A\}\_\{t\}\]=\(n\-1\)\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{B\}\_\{t\-1\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}
- •Between Deviation:Forτ=t\\tau=t,X¯t−𝝁¯t=𝚺¯t−11/2𝜼¯t\(1−1t\)\\bar\{X\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{t\}=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\bar\{\\bm\{\\eta\}\}\_\{t\}\(1\-\\frac\{1\}\{t\}\)\. 𝔼\[Term Bt\]=n\(t−1t\)2𝚺¯t−11/2\(𝐁t−1n\+𝒂t−1𝒂t−1⊤\)\(𝚺¯t−11/2\)⊤\\mathbb\{E\}\[\\text\{Term B\}\_\{t\}\]=n\\left\(\\frac\{t\-1\}\{t\}\\right\)^\{2\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\left\(\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\right\)\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(100\)
Step 3: Aggregation and Matrix RecurrenceSumming all terms and factoring out𝚺¯t−11/2\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}:
𝔼\[nt𝚺¯t\|ℱt−1\]=𝚺¯t−11/2\{n\(t−1\)𝐈d\+\(n−1\)𝐁t−1\+\[n\(t−1\)t2\+n\(t−1\)2t2\]\(𝐁t−1n\+𝒂t−1𝒂t−1⊤\)\}\(𝚺¯t−11/2\)⊤\\displaystyle\\mathbb\{E\}\[nt\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\Bigg\\\{n\(t\-1\)\\mathbf\{I\}\_\{d\}\+\(n\-1\)\\mathbf\{B\}\_\{t\-1\}\\quad\+\\left\[\\frac\{n\(t\-1\)\}\{t^\{2\}\}\+\\frac\{n\(t\-1\)^\{2\}\}\{t^\{2\}\}\\right\]\\left\(\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\right\)\\Bigg\\\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(101\)Simplifying the coefficient in the bracket:n\(t−1\)\+n\(t−1\)2t2=n\(t−1\)tt2=nt−1t\\frac\{n\(t\-1\)\+n\(t\-1\)^\{2\}\}\{t^\{2\}\}=\\frac\{n\(t\-1\)t\}\{t^\{2\}\}=n\\frac\{t\-1\}\{t\}\. Expanding the terms:
𝔼\[nt𝚺¯t\|ℱt−1\]=𝚺¯t−11/2\{n\(t−1\)𝐈d\+\(n−1\)𝐁t−1\+t−1t𝐁t−1\+nt−1t𝒂t−1𝒂t−1⊤\}\(𝚺¯t−11/2\)⊤\\mathbb\{E\}\[nt\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\left\\\{n\(t\-1\)\\mathbf\{I\}\_\{d\}\+\(n\-1\)\\mathbf\{B\}\_\{t\-1\}\+\\frac\{t\-1\}\{t\}\\mathbf\{B\}\_\{t\-1\}\+n\\frac\{t\-1\}\{t\}\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\right\\\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(102\)Dividing byntntto find the recurrence for𝚺¯t\\bar\{\\bm\{\\Sigma\}\}\_\{t\}:
𝔼\[𝚺¯t\|ℱt−1\]\\displaystyle\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\|\\mathcal\{F\}\_\{t\-1\}\]=𝚺¯t−11/2\{t−1t𝐈d\+n−1nt𝐁t−1\+t−1nt2𝐁t−1\+t−1t2𝒂t−1𝒂t−1⊤\}\(𝚺¯t−11/2\)⊤\\displaystyle=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\Bigg\\\{\\frac\{t\-1\}\{t\}\\mathbf\{I\}\_\{d\}\+\\frac\{n\-1\}\{nt\}\\mathbf\{B\}\_\{t\-1\}\+\\frac\{t\-1\}\{nt^\{2\}\}\\mathbf\{B\}\_\{t\-1\}\+\\frac\{t\-1\}\{t^\{2\}\}\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\Bigg\\\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(103\)=𝚺¯t−11/2\{\(1−1t\)𝐈d\+\(1t−1tn\+1tn−1nt2\)𝐁t−1\+\(1t−1t2\)𝒂t−1𝒂t−1⊤\}\(𝚺¯t−11/2\)⊤\\displaystyle=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\Bigg\\\{\\left\(1\-\\frac\{1\}\{t\}\\right\)\\mathbf\{I\}\_\{d\}\+\\left\(\\frac\{1\}\{t\}\-\\frac\{1\}\{tn\}\+\\frac\{1\}\{tn\}\-\\frac\{1\}\{nt^\{2\}\}\\right\)\\mathbf\{B\}\_\{t\-1\}\+\\left\(\\frac\{1\}\{t\}\-\\frac\{1\}\{t^\{2\}\}\\right\)\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\Bigg\\\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(104\)Grouping terms by powers oftt:
- •Coefficient of1/t1/t:−𝐈d\+𝐁t−1\+𝒂t−1𝒂t−1⊤=−\(𝐈d−\(𝐁t−1\+𝒂t−1𝒂t−1⊤\)\)=−𝚿t−1\-\\mathbf\{I\}\_\{d\}\+\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}=\-\(\\mathbf\{I\}\_\{d\}\-\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\)\)=\-\\mathbf\{\\Psi\}\_\{t\-1\}\.
- •Coefficient of1/t21/t^\{2\}:−𝐁t−1n−𝒂t−1𝒂t−1⊤=−\(𝐁t−1n\+𝒂t−1𝒂t−1⊤\)=−𝐊t−1\-\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\-\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}=\-\(\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\)=\-\\mathbf\{K\}\_\{t\-1\}\.
Here, we define the matrices as follows:
- •𝚿t−1≜𝐈d−\(𝐁t−1\+𝒂t−1𝒂t−1⊤\)\\mathbf\{\\Psi\}\_\{t\-1\}\\triangleq\\mathbf\{I\}\_\{d\}\-\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\)represents the first\-order dissipation matrix\.
- •𝐊t−1≜𝐁t−1n\+𝒂t−1𝒂t−1⊤\\mathbf\{K\}\_\{t\-1\}\\triangleq\\frac\{\\mathbf\{B\}\_\{t\-1\}\}\{n\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}denotes the second\-order perturbation matrix\.
This yields the exact matrix recurrence:
𝔼\[𝚺¯t∣ℱt−1\]=𝚺¯t−11/2\(𝐈d−𝚿t−1t−𝐊t−1t2\)\(𝚺¯t−11/2\)⊤\\mathbb\{E\}\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\left\(\\mathbf\{I\}\_\{d\}\-\\frac\{\\mathbf\{\\Psi\}\_\{t\-1\}\}\{t\}\-\\frac\{\\mathbf\{K\}\_\{t\-1\}\}\{t^\{2\}\}\\right\)\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\)^\{\\top\}\(105\)
##### Step 4: Covariance recursion and collapse under the local\-basin assumption\.
We now make explicit the local\-basin assumption used in the proof\. Assume that the initialization lies in the local basin of attraction of the target state𝐮∗\\mathbf\{u\}^\{\\ast\}, and that the trajectory remains in this local basin for all iterations\. Under this assumption, the first\-order dissipation matrix admits a uniform positive lower spectral bound along the trajectory\. Namely, there exists a constantψ\>0\\psi\>0such that
λmin\(𝚿t−1\)≥ψfor allt≥1\.\\lambda\_\{\\min\}\(\\mathbf\{\\Psi\}\_\{t\-1\}\)\\geq\\psi\\qquad\\text\{for all \}t\\geq 1\.\(106\)
Recall the matrix recurrence established above:
𝔼\[𝚺¯t∣ℱt−1\]=𝚺¯t−1−1t𝚺¯t−11/2𝚿t−1𝚺¯t−11/2−1t2𝚺¯t−11/2𝐊t−1𝚺¯t−11/2,\\mathbb\{E\}\\\!\\left\[\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\\right\]=\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\-\\frac\{1\}\{t\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{\\Psi\}\_\{t\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\-\\frac\{1\}\{t^\{2\}\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\}\\mathbf\{K\}\_\{t\-1\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}^\{1/2\},\(107\)where𝐊t−1⪰𝟎\\mathbf\{K\}\_\{t\-1\}\\succeq\\mathbf\{0\}\.
Let
St≔Tr\(𝚺¯t\)\.S\_\{t\}\\coloneqq\\operatorname\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\.\(108\)Applying the trace operator to[Eq\.˜107](https://arxiv.org/html/2606.13732#A2.E107)and using linearity of conditional expectation, we obtain
𝔼\[St∣ℱt−1\]=St−1−1tTr\(𝚺¯t−1𝚿t−1\)−1t2Tr\(𝚺¯t−1𝐊t−1\)\.\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=S\_\{t\-1\}\-\\frac\{1\}\{t\}\\operatorname\{Tr\}\\\!\\bigl\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\\mathbf\{\\Psi\}\_\{t\-1\}\\bigr\)\-\\frac\{1\}\{t^\{2\}\}\\operatorname\{Tr\}\\\!\\bigl\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\\mathbf\{K\}\_\{t\-1\}\\bigr\)\.\(109\)Since𝚺¯t−1⪰𝟎\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\\succeq\\mathbf\{0\}and𝐊t−1⪰𝟎\\mathbf\{K\}\_\{t\-1\}\\succeq\\mathbf\{0\}, the last term is nonnegative; hence,
𝔼\[St∣ℱt−1\]≤St−1−1tTr\(𝚺¯t−1𝚿t−1\)\.\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq S\_\{t\-1\}\-\\frac\{1\}\{t\}\\operatorname\{Tr\}\\\!\\bigl\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\\mathbf\{\\Psi\}\_\{t\-1\}\\bigr\)\.\(110\)
Since both𝚺¯t−1\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}and𝚿t−1\\mathbf\{\\Psi\}\_\{t\-1\}are symmetric positive semidefinite,[Lemma˜4](https://arxiv.org/html/2606.13732#Thmlemma4)gives
Tr\(𝚺¯t−1𝚿t−1\)≥λmin\(𝚿t−1\)Tr\(𝚺¯t−1\)=λmin\(𝚿t−1\)St−1\.\\operatorname\{Tr\}\\\!\\bigl\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\\mathbf\{\\Psi\}\_\{t\-1\}\\bigr\)\\geq\\lambda\_\{\\min\}\(\\mathbf\{\\Psi\}\_\{t\-1\}\)\\,\\operatorname\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\-1\}\)=\\lambda\_\{\\min\}\(\\mathbf\{\\Psi\}\_\{t\-1\}\)\\,S\_\{t\-1\}\.\(111\)Combining this with[Eq\.˜106](https://arxiv.org/html/2606.13732#A2.E106), we obtain
𝔼\[St∣ℱt−1\]≤\(1−ψt\)St−1,t≥1\.\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\left\(1\-\\frac\{\\psi\}\{t\}\\right\)S\_\{t\-1\},\\qquad t\\geq 1\.\(112\)
Since
𝚿t−1=𝐈d−\(𝐁t−1\+𝒂t−1𝒂t−1⊤\),\\mathbf\{\\Psi\}\_\{t\-1\}=\\mathbf\{I\}\_\{d\}\-\\bigl\(\\mathbf\{B\}\_\{t\-1\}\+\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\bigr\),\(113\)with𝐁t−1⪰𝟎\\mathbf\{B\}\_\{t\-1\}\\succeq\\mathbf\{0\}and𝒂t−1𝒂t−1⊤⪰𝟎\\bm\{a\}\_\{t\-1\}\\bm\{a\}\_\{t\-1\}^\{\\top\}\\succeq\\mathbf\{0\}, we also have𝚿t−1⪯𝐈d\\mathbf\{\\Psi\}\_\{t\-1\}\\preceq\\mathbf\{I\}\_\{d\}\. Therefore,
0<ψ≤1\.0<\\psi\\leq 1\.\(114\)
We now show that the covariance converges to the zero matrix almost surely\. To avoid the degenerate boundary factor1−ψ1\-\\psiatk=1k=1whenψ=1\\psi=1, we normalize the recursion starting fromk=2k=2\. Define
P1≔1,Pt≔∏k=2t\(1−ψk\),t≥2\.P\_\{1\}\\coloneqq 1,\\qquad P\_\{t\}\\coloneqq\\prod\_\{k=2\}^\{t\}\\left\(1\-\\frac\{\\psi\}\{k\}\\right\),\\qquad t\\geq 2\.\(115\)By[Eq\.˜114](https://arxiv.org/html/2606.13732#A2.E114), each factor in[Eq\.˜115](https://arxiv.org/html/2606.13732#A2.E115)is strictly positive, so
Pt\>0for allt≥1\.P\_\{t\}\>0\\qquad\\text\{for all \}t\\geq 1\.\(116\)Moreover, by construction,
Pt=Pt−1\(1−ψt\),t≥2\.P\_\{t\}=P\_\{t\-1\}\\left\(1\-\\frac\{\\psi\}\{t\}\\right\),\\qquad t\\geq 2\.\(117\)
We normalizeStS\_\{t\}byPtP\_\{t\}and define
Mt≔StPt,t≥1\.M\_\{t\}\\coloneqq\\frac\{S\_\{t\}\}\{P\_\{t\}\},\\qquad t\\geq 1\.\(118\)SincePtP\_\{t\}is deterministic andStS\_\{t\}isℱt\\mathcal\{F\}\_\{t\}\-measurable, the process\{Mt\}t≥1\\\{M\_\{t\}\\\}\_\{t\\geq 1\}is adapted to the filtration\{ℱt\}t≥1\\\{\\mathcal\{F\}\_\{t\}\\\}\_\{t\\geq 1\}\. Furthermore, for everyt≥2t\\geq 2, combining[Eqs\.˜112](https://arxiv.org/html/2606.13732#A2.E112)and[117](https://arxiv.org/html/2606.13732#A2.E117)yields
𝔼\[Mt∣ℱt−1\]\\displaystyle\\mathbb\{E\}\[M\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=1Pt𝔼\[St∣ℱt−1\]\\displaystyle=\\frac\{1\}\{P\_\{t\}\}\\,\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\(119\)≤1Pt\(1−ψt\)St−1\\displaystyle\\leq\\frac\{1\}\{P\_\{t\}\}\\left\(1\-\\frac\{\\psi\}\{t\}\\right\)S\_\{t\-1\}\(120\)=St−1Pt−1=Mt−1\.\\displaystyle=\\frac\{S\_\{t\-1\}\}\{P\_\{t\-1\}\}=M\_\{t\-1\}\.\(121\)SinceSt≥0S\_\{t\}\\geq 0andPt\>0P\_\{t\}\>0, we also haveMt≥0M\_\{t\}\\geq 0\. Therefore,\{Mt\}t≥1\\\{M\_\{t\}\\\}\_\{t\\geq 1\}is a nonnegative supermartingale\.
By Doob’s supermartingale convergence theorem, there exists an almost surely finite random variableM∞M\_\{\\infty\}such that
Mt→a\.s\.M∞<∞\.M\_\{t\}\\xrightarrow\{a\.s\.\}M\_\{\\infty\}<\\infty\.\(122\)
Next, using the elementary inequality1−x≤e−x1\-x\\leq e^\{\-x\}, we obtain
Pt\\displaystyle P\_\{t\}=∏k=2t\(1−ψk\)\\displaystyle=\\prod\_\{k=2\}^\{t\}\\left\(1\-\\frac\{\\psi\}\{k\}\\right\)\(123\)≤∏k=2texp\(−ψk\)\\displaystyle\\leq\\prod\_\{k=2\}^\{t\}\\exp\\left\(\-\\frac\{\\psi\}\{k\}\\right\)\(124\)=exp\(−ψ∑k=2t1k\)\.\\displaystyle=\\exp\\left\(\-\\psi\\sum\_\{k=2\}^\{t\}\\frac\{1\}\{k\}\\right\)\.\(125\)Since the harmonic sum satisfies
∑k=2t1k=logt\+𝒪\(1\),t→∞,\\sum\_\{k=2\}^\{t\}\\frac\{1\}\{k\}=\\log t\+\\mathcal\{O\}\(1\),\\qquad t\\to\\infty,\(126\)it follows that
Pt=𝒪\(t−ψ\),P\_\{t\}=\\mathcal\{O\}\(t^\{\-\\psi\}\),\(127\)and in particular,
Combining[Eqs\.˜122](https://arxiv.org/html/2606.13732#A2.E122)and[128](https://arxiv.org/html/2606.13732#A2.E128), we conclude
St=MtPt→a\.s\.0\.S\_\{t\}=M\_\{t\}P\_\{t\}\\xrightarrow\{a\.s\.\}0\.\(129\)Recalling thatSt=Tr\(𝚺¯t\)S\_\{t\}=\\operatorname\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\), we obtain
Tr\(𝚺¯t\)→a\.s\.0\.\\operatorname\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\\xrightarrow\{a\.s\.\}0\.\(130\)
Finally, since𝚺¯t⪰𝟎\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\succeq\\mathbf\{0\}for alltt, all eigenvalues of𝚺¯t\\bar\{\\bm\{\\Sigma\}\}\_\{t\}are nonnegative, and their sum equalsTr\(𝚺¯t\)\\operatorname\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\. Therefore,[Eq\.˜130](https://arxiv.org/html/2606.13732#A2.E130)implies that every eigenvalue converges to zero, and hence
𝚺¯t→a\.s\.𝟎\.\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\xrightarrow\{a\.s\.\}\\mathbf\{0\}\.\(131\)
The sharper recursion in[Eq\.˜112](https://arxiv.org/html/2606.13732#A2.E112)will be used again in[Subsection˜B\.5](https://arxiv.org/html/2606.13732#A2.SS5)to derive the explicit power\-law rate in[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)\.
#### B\.4\.3Wasserstein Distance
The limit of the Wasserstein distance is:
limt→∞𝔼\[𝕎22\]=limt→∞\(𝔼\[‖𝝁¯t−𝝁¯0‖2\]\+𝔼\[𝔅2\(𝚺¯t,𝚺¯0\)\]\)\\lim\_\{t\\to\\infty\}\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\]=\\lim\_\{t\\to\\infty\}\\left\(\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]\+\\mathbb\{E\}\[\\mathfrak\{B\}^\{2\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\]\\right\)\(132\)1. 1\.Mean Term:Since𝝁¯t→u∗\\bar\{\\bm\{\\mu\}\}\_\{t\}\\to u^\{\*\}, the distance converges to the bias: 𝔼\[‖𝝁¯t−𝝁¯0‖2\]→‖u∗−𝝁¯0‖2\\mathbb\{E\}\[\\\|\\bar\{\\bm\{\\mu\}\}\_\{t\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\]\\to\\\|u^\{\*\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\(133\)
2. 2\.Covariance Term:Since𝚺¯t→𝟎\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\\to\\mathbf\{0\}, the Bures metric simplifies: 𝔅2\(𝚺¯t,𝚺¯0\)=Tr\(𝚺¯t\+𝚺¯0−2\(𝚺¯t1/2𝚺¯0𝚺¯t1/2\)1/2\)→Tr\(𝟎\+𝚺¯0−𝟎\)=Tr\(𝚺¯0\)\\mathfrak\{B\}^\{2\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\},\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\-2\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}^\{1/2\}\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\\bar\{\\bm\{\\Sigma\}\}\_\{t\}^\{1/2\}\)^\{1/2\}\)\\to\\text\{Tr\}\(\\mathbf\{0\}\+\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\-\\mathbf\{0\}\)=\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\(134\)
Thus, the total Wasserstein distance converges to:
limt→∞𝔼\[𝕎22\]=‖u∗−𝝁¯0‖2\+Tr\(𝚺¯0\)\\lim\_\{t\\to\\infty\}\\mathbb\{E\}\[\\mathbb\{W\}\_\{2\}^\{2\}\]=\\\|u^\{\*\}\-\\bar\{\\bm\{\\mu\}\}\_\{0\}\\\|^\{2\}\+\{\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{0\}\)\}\(135\)∎
### B\.5Proof of[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)
See[2](https://arxiv.org/html/2606.13732#Thmtheorem2)
###### Proof\.
We study the asymptotic decay rate of the covariance trace
St≔Tr\(𝚺¯t\)\.S\_\{t\}\\coloneqq\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)\.\(136\)By construction,St≥0S\_\{t\}\\geq 0for alltt\.
From[Subsection˜B\.4](https://arxiv.org/html/2606.13732#A2.SS4), under the local\-basin assumption there exists a constantψ\>0\\psi\>0such that
𝔼\[St∣ℱt−1\]≤\(1−ψt\)St−1,t≥2,\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\left\(1\-\\frac\{\\psi\}\{t\}\\right\)S\_\{t\-1\},\\qquad t\\geq 2,\(137\)with
0<ψ≤1\.0<\\psi\\leq 1\.\(138\)
To absorb the multiplicative contraction factor in[Eq\.˜137](https://arxiv.org/html/2606.13732#A2.E137), define the deterministic sequence
P1≔1,Pt≔∏k=2t\(1−ψk\),t≥2\.P\_\{1\}\\coloneqq 1,\\qquad P\_\{t\}\\coloneqq\\prod\_\{k=2\}^\{t\}\\left\(1\-\\frac\{\\psi\}\{k\}\\right\),\\qquad t\\geq 2\.\(139\)By[Eq\.˜138](https://arxiv.org/html/2606.13732#A2.E138), every factor in[Eq\.˜139](https://arxiv.org/html/2606.13732#A2.E139)is strictly positive, so
Pt\>0for allt≥1\.P\_\{t\}\>0\\qquad\\text\{for all \}t\\geq 1\.\(140\)Moreover, by construction,
Pt=Pt−1\(1−ψt\),t≥2\.P\_\{t\}=P\_\{t\-1\}\\left\(1\-\\frac\{\\psi\}\{t\}\\right\),\\qquad t\\geq 2\.\(141\)
We now normalizeStS\_\{t\}byPtP\_\{t\}and define
Mt≔StPt,t≥1\.M\_\{t\}\\coloneqq\\frac\{S\_\{t\}\}\{P\_\{t\}\},\\qquad t\\geq 1\.\(142\)SincePtP\_\{t\}is deterministic andStS\_\{t\}isℱt\\mathcal\{F\}\_\{t\}\-measurable, the process\{Mt\}t≥1\\\{M\_\{t\}\\\}\_\{t\\geq 1\}is adapted to the filtration\{ℱt\}t≥1\\\{\\mathcal\{F\}\_\{t\}\\\}\_\{t\\geq 1\}\. Furthermore, for everyt≥2t\\geq 2, combining[Eqs\.˜137](https://arxiv.org/html/2606.13732#A2.E137)and[141](https://arxiv.org/html/2606.13732#A2.E141)yields
𝔼\[Mt∣ℱt−1\]\\displaystyle\\mathbb\{E\}\[M\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=1Pt𝔼\[St∣ℱt−1\]\\displaystyle=\\frac\{1\}\{P\_\{t\}\}\\,\\mathbb\{E\}\[S\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\(143\)≤1Pt\(1−ψt\)St−1\\displaystyle\\leq\\frac\{1\}\{P\_\{t\}\}\\left\(1\-\\frac\{\\psi\}\{t\}\\right\)S\_\{t\-1\}\(144\)=St−1Pt−1=Mt−1\.\\displaystyle=\\frac\{S\_\{t\-1\}\}\{P\_\{t\-1\}\}=M\_\{t\-1\}\.\(145\)SinceSt≥0S\_\{t\}\\geq 0andPt\>0P\_\{t\}\>0, we also haveMt≥0M\_\{t\}\\geq 0\. Therefore,\{Mt\}t≥1\\\{M\_\{t\}\\\}\_\{t\\geq 1\}is a nonnegative supermartingale\.
By Doob’s supermartingale convergence theorem, there exists an almost surely finite random variableM∞M\_\{\\infty\}such that
Mt→a\.s\.M∞<∞\.M\_\{t\}\\xrightarrow\{a\.s\.\}M\_\{\\infty\}<\\infty\.\(146\)In particular,
Mt=𝒪a\.s\.\(1\)\.M\_\{t\}=\\mathcal\{O\}\_\{a\.s\.\}\(1\)\.\(147\)
It remains to estimate the deterministic factorPtP\_\{t\}\. Using the elementary inequality1−x≤e−x1\-x\\leq e^\{\-x\}, we obtain
Pt\\displaystyle P\_\{t\}=∏k=2t\(1−ψk\)\\displaystyle=\\prod\_\{k=2\}^\{t\}\\left\(1\-\\frac\{\\psi\}\{k\}\\right\)\(148\)≤∏k=2texp\(−ψk\)\\displaystyle\\leq\\prod\_\{k=2\}^\{t\}\\exp\\left\(\-\\frac\{\\psi\}\{k\}\\right\)\(149\)=exp\(−ψ∑k=2t1k\)\.\\displaystyle=\\exp\\left\(\-\\psi\\sum\_\{k=2\}^\{t\}\\frac\{1\}\{k\}\\right\)\.\(150\)Since the harmonic sum satisfies
∑k=2t1k=logt\+𝒪\(1\),t→∞,\\sum\_\{k=2\}^\{t\}\\frac\{1\}\{k\}=\\log t\+\\mathcal\{O\}\(1\),\\qquad t\\to\\infty,\(151\)it follows that
Pt=𝒪\(t−ψ\)\.P\_\{t\}=\\mathcal\{O\}\(t^\{\-\\psi\}\)\.\(152\)
Finally, combining[Eqs\.˜142](https://arxiv.org/html/2606.13732#A2.E142),[147](https://arxiv.org/html/2606.13732#A2.E147)and[152](https://arxiv.org/html/2606.13732#A2.E152), we obtain
St=MtPt=𝒪a\.s\.\(t−ψ\)\.S\_\{t\}=M\_\{t\}P\_\{t\}=\\mathcal\{O\}\_\{a\.s\.\}\(t^\{\-\\psi\}\)\.\(153\)Recalling the definition ofStS\_\{t\}, we conclude that
Tr\(𝚺¯t\)=𝒪a\.s\.\(t−ψ\),\\text\{Tr\}\(\\bar\{\\bm\{\\Sigma\}\}\_\{t\}\)=\\mathcal\{O\}\_\{a\.s\.\}\(t^\{\-\\psi\}\),\(154\)which proves[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)\. ∎
### B\.6Proof of[Theorem˜3](https://arxiv.org/html/2606.13732#Thmtheorem3)
See[3](https://arxiv.org/html/2606.13732#Thmtheorem3)
###### Proof\.
Letℒ:𝒴×𝒴→ℝ\+\\mathcal\{L\}:\\mathcal\{Y\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}^\{\+\}denote the loss function\. Our objective is to bound the expected risk on the oracle distribution,ℛ𝒟∗\(ht;g∗\)\\mathcal\{R\}\_\{\\mathcal\{D\}^\{\*\}\}\(h\_\{t\};g^\{\*\}\), by relating it to the risk on the filtered synthetic distribution,ℛ𝒟t\(ht;gt\)\\mathcal\{R\}\_\{\\mathcal\{D\}\_\{t\}\}\(h\_\{t\};g\_\{t\}\), plus additional terms accounting for the distributional shift\.
Step 1: Primal Formulation via Optimal Coupling
We begin by expressing the risks as expectations\. By definition, the risk on the oracle distribution𝒟∗\\mathcal\{D\}^\{\*\}is:
ℛ𝒟∗\(ht;g∗\)=𝔼x∗∼𝒟∗\[ℒ\(ht\(x∗\),g∗\(x∗\)\)\]\.\\mathcal\{R\}\_\{\\mathcal\{D\}^\{\*\}\}\(h\_\{t\};g^\{\*\}\)=\\mathbb\{E\}\_\{x^\{\*\}\\sim\\mathcal\{D\}^\{\*\}\}\\left\[\\mathcal\{L\}\(h\_\{t\}\(x^\{\*\}\),g^\{\*\}\(x^\{\*\}\)\)\\right\]\.\(155\)Analogously, the risk on the filtered synthetic distribution𝒟t\\mathcal\{D\}\_\{t\}is:
ℛ𝒟t\(ht;gt\)=𝔼xt∼𝒟t\[ℒ\(ht\(xt\),gt\(xt\)\)\]\.\\mathcal\{R\}\_\{\\mathcal\{D\}\_\{t\}\}\(h\_\{t\};g\_\{t\}\)=\\mathbb\{E\}\_\{x\_\{t\}\\sim\\mathcal\{D\}\_\{t\}\}\\left\[\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g\_\{t\}\(x\_\{t\}\)\)\\right\]\.\(156\)
To compare these two expectations directly, we utilize the theory of optimal transport\. Letπ∈Π\(𝒟t,𝒟∗\)\\pi\\in\\Pi\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)be the optimal coupling that minimizes the transport cost, thereby realizing thepp\-Wasserstein distance𝕎p\(𝒟t,𝒟∗\)\\mathbb\{W\}\_\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)with respect to the metricd𝒳d\_\{\\mathcal\{X\}\}:
𝕎pp\(𝒟t,𝒟∗\)=∫𝒳×𝒳d𝒳p\(xt,x∗\)dπ\(xt,x∗\)\.\\mathbb\{W\}\_\{p\}^\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)=\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\_\{\\mathcal\{X\}\}^\{p\}\(x\_\{t\},x^\{\*\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\.\(157\)
Leveraging the marginal properties ofπ\\pi\(which allows us to unify the integration domains\), we can rewrite the absolute difference in risks, denoted byΔ\\Delta, as a single integral over the product space𝒳×𝒳\\mathcal\{X\}\\times\\mathcal\{X\}:
Δ\\displaystyle\\Delta≔\|ℛ𝒟∗\(ht;g∗\)−ℛ𝒟t\(ht;gt\)\|\\displaystyle\\coloneqq\\left\|\\mathcal\{R\}\_\{\\mathcal\{D\}^\{\*\}\}\(h\_\{t\};g^\{\*\}\)\-\\mathcal\{R\}\_\{\\mathcal\{D\}\_\{t\}\}\(h\_\{t\};g\_\{t\}\)\\right\|\(158\)=\|∫𝒳×𝒳\(ℒ\(ht\(x∗\),g∗\(x∗\)\)−ℒ\(ht\(xt\),gt\(xt\)\)\)dπ\(xt,x∗\)\|\.\\displaystyle=\\left\|\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\left\(\\mathcal\{L\}\(h\_\{t\}\(x^\{\*\}\),g^\{\*\}\(x^\{\*\}\)\)\-\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g\_\{t\}\(x\_\{t\}\)\)\\right\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\\right\|\.
By invoking the integral triangle inequality,\|∫𝒳f\(x\)𝑑x\|≤∫𝒳\|f\(x\)\|𝑑x\\left\|\\int\_\{\\mathcal\{X\}\}f\(x\)\\,dx\\right\|\\leq\\int\_\{\\mathcal\{X\}\}\|f\(x\)\|\\,dx, we derive the following upper bound on the risk gap:
Δ≤∫𝒳×𝒳\|ℒ\(ht\(x∗\),g∗\(x∗\)\)−ℒ\(ht\(xt\),gt\(xt\)\)\|dπ\(xt,x∗\)\.\\Delta\\leq\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\left\|\\mathcal\{L\}\(h\_\{t\}\(x^\{\*\}\),g^\{\*\}\(x^\{\*\}\)\)\-\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g\_\{t\}\(x\_\{t\}\)\)\\right\|\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\.\(159\)
Step 2: Pointwise Decomposition
To isolate the sources of error, we perform a pointwise decomposition of the integrand\. For any coupled pair\(xt,x∗\)∈𝒳×𝒳\(x\_\{t\},x^\{\*\}\)\\in\\mathcal\{X\}\\times\\mathcal\{X\}, we insert an intermediate termℒ\(ht\(xt\),g∗\(x∗\)\)\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g^\{\*\}\(x^\{\*\}\)\)and apply the triangle inequality\. This splits the total error into two distinct components: the model’s geometric sensitivity and the target’s alignment shift\.
\|ℒ\(ht\(x∗\),g∗\(x∗\)\)−ℒ\(ht\(xt\),gt\(xt\)\)\|\\displaystyle\\left\|\\mathcal\{L\}\(h\_\{t\}\(x^\{\*\}\),g^\{\*\}\(x^\{\*\}\)\)\-\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g\_\{t\}\(x\_\{t\}\)\)\\right\|\(160\)≤\|ℒ\(ht\(x∗\),g∗\(x∗\)\)−ℒ\(ht\(xt\),g∗\(x∗\)\)\|⏟Term \(I\): Hypothesis Variation\+\|ℒ\(ht\(xt\),g∗\(x∗\)\)−ℒ\(ht\(xt\),gt\(xt\)\)\|⏟Term \(II\): Target Shift\.\\displaystyle\\leq\\underbrace\{\\left\|\\mathcal\{L\}\(h\_\{t\}\(x^\{\*\}\),g^\{\*\}\(x^\{\*\}\)\)\-\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g^\{\*\}\(x^\{\*\}\)\)\\right\|\}\_\{\\text\{Term \(I\): Hypothesis Variation\}\}\+\\underbrace\{\\left\|\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g^\{\*\}\(x^\{\*\}\)\)\-\\mathcal\{L\}\(h\_\{t\}\(x\_\{t\}\),g\_\{t\}\(x\_\{t\}\)\)\\right\|\}\_\{\\text\{Term \(II\): Target Shift\}\}\.
Step 3: Bounding Term \(I\) \(Hypothesis Stability\)
We first bound Term \(I\), which represents the stability of the hypothesishth\_\{t\}against input perturbations\. Based on[Assumption˜2](https://arxiv.org/html/2606.13732#Thmassumption2), the lossℒ\\mathcal\{L\}isℓ\\ell\-Lipschitz in its first argument, and the hypothesishth\_\{t\}isϵ\\epsilon\-Lipschitz\. Chaining these properties yields:
Term \(I\)≤ℓ⋅d𝒴\(ht\(x∗\),ht\(xt\)\)≤ℓ⋅ϵ⋅d𝒳\(x∗,xt\)\.\\text\{Term \(I\)\}\\leq\\ell\\cdot d\_\{\\mathcal\{Y\}\}\(h\_\{t\}\(x^\{\*\}\),h\_\{t\}\(x\_\{t\}\)\)\\leq\\ell\\cdot\\epsilon\\cdot d\_\{\\mathcal\{X\}\}\(x^\{\*\},x\_\{t\}\)\.\(161\)Integrating this bound over the couplingπ\\pi:
∫𝒳×𝒳Term \(I\)dπ\(xt,x∗\)≤ℓϵ∫𝒳×𝒳d𝒳\(x∗,xt\)dπ\(xt,x∗\)\.\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\text\{Term \(I\)\}\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\\leq\\ell\\epsilon\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\_\{\\mathcal\{X\}\}\(x^\{\*\},x\_\{t\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\.\(162\)
Step 4: Bounding Term \(II\) \(Target Shift\)
Next, we bound Term \(II\), which captures the alignment gap between the oracle targetg∗g^\{\*\}and the local verification targetgtg\_\{t\}\. We partition the integration domain into two regions: an “aligned set”SSand a “misaligned set”ScS^\{c\}, whereπ\(Sc\)≤δ\\pi\(S^\{c\}\)\\leq\\delta\.
- •On the aligned setSS:The targets satisfy the probabilistic Lipschitz condition\. Using theℓ\\ell\-Lipschitz property of the loss in its second argument: Term \(II\)⋅𝟏S≤ℓ⋅d𝒴\(g∗\(x∗\),gt\(xt\)\)≤ℓϵ⋅d𝒳\(x∗,xt\)\.\\text\{Term \(II\)\}\\cdot\\bm\{1\}\_\{S\}\\leq\\ell\\cdot d\_\{\\mathcal\{Y\}\}\(g^\{\*\}\(x^\{\*\}\),g\_\{t\}\(x\_\{t\}\)\)\\leq\\ell\\epsilon\\cdot d\_\{\\mathcal\{X\}\}\(x^\{\*\},x\_\{t\}\)\.\(163\)
- •On the misaligned setScS^\{c\}:We employ the worst\-case bound using the finite diameterC𝒴≜supy,y′d𝒴\(y,y′\)C\_\{\\mathcal\{Y\}\}\\triangleq\\sup\_\{y,y^\{\\prime\}\}d\_\{\\mathcal\{Y\}\}\(y,y^\{\\prime\}\)of the output space\. Term \(II\)⋅𝟏Sc≤ℓ⋅C𝒴\.\\text\{Term \(II\)\}\\cdot\\bm\{1\}\_\{S^\{c\}\}\\leq\\ell\\cdot C\_\{\\mathcal\{Y\}\}\.\(164\)Remark on the DiameterC𝒴C\_\{\\mathcal\{Y\}\}:The boundedness ofC𝒴C\_\{\\mathcal\{Y\}\}is intrinsic to our task\-specific ground metrics defined in[Subsection˜C\.1](https://arxiv.org/html/2606.13732#A3.SS1)\. For instance, in language modeling tasks where the output space𝒴\\mathcal\{Y\}is the probability simplexΔV−1\\Delta^\{V\-1\}over a vocabulary of sizeVV, equipped with theL1L\_\{1\}distance \(a special case of WMD with discrete topology\), the diameter is strictly bounded by 2\. Specifically, for any two distributionsPa,Pb∈ΔV−1P\_\{a\},P\_\{b\}\\in\\Delta^\{V\-1\}: ‖Pa−Pb‖1=∑i=1V\|pi−qi\|≤∑i=1V\|pi\|\+∑i=1V\|qi\|=1\+1=2\.\\\|P\_\{a\}\-P\_\{b\}\\\|\_\{1\}=\\sum\_\{i=1\}^\{V\}\|p\_\{i\}\-q\_\{i\}\|\\leq\\sum\_\{i=1\}^\{V\}\|p\_\{i\}\|\+\\sum\_\{i=1\}^\{V\}\|q\_\{i\}\|=1\+1=2\.\(165\)The equality holds when the supports ofPaP\_\{a\}andPbP\_\{b\}are disjoint \(e\.g\., distinct one\-hot vectors\)\.
Integrating Term \(II\) over the partitioned domain yields:
∫𝒳×𝒳Term \(II\)dπ\(xt,x∗\)≤ℓϵ∫𝒳×𝒳d𝒳\(x∗,xt\)dπ\(xt,x∗\)\+ℓC𝒴δ\.\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\text\{Term \(II\)\}\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\\leq\\ell\\epsilon\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\_\{\\mathcal\{X\}\}\(x^\{\*\},x\_\{t\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\+\\ell C\_\{\\mathcal\{Y\}\}\\delta\.\(166\)
Step 5: Final Aggregation and Bound
Substituting the results from[Eq\.˜162](https://arxiv.org/html/2606.13732#A2.E162)and[Eq\.˜166](https://arxiv.org/html/2606.13732#A2.E166)back into[Eq\.˜159](https://arxiv.org/html/2606.13732#A2.E159), we aggregate the linear terms with respect tod𝒳d\_\{\\mathcal\{X\}\}\. Note that both Term \(I\) and Term \(II\) contribute anℓϵ\\ell\\epsilonfactor:
Δ≤∫𝒳×𝒳\(ℓϵ\+ℓϵ\)⋅d𝒳\(x∗,xt\)dπ\(xt,x∗\)\+ℓC𝒴δ\.\\Delta\\leq\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\(\\ell\\epsilon\+\\ell\\epsilon\)\\cdot d\_\{\\mathcal\{X\}\}\(x^\{\*\},x\_\{t\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\+\\ell C\_\{\\mathcal\{Y\}\}\\delta\.\(167\)
Finally, to relate the expected distance∫d𝒳dπ\\int d\_\{\\mathcal\{X\}\}\\mathrm\{d\}\\pito thepp\-Wasserstein metric𝕎p\\mathbb\{W\}\_\{p\}, we apply Jensen’s inequality:
∫𝒳×𝒳d𝒳\(xt,x∗\)dπ\(xt,x∗\)≤\(∫𝒳×𝒳d𝒳p\(xt,x∗\)dπ\(xt,x∗\)\)1/p=𝕎p\(𝒟t,𝒟∗\)\.\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\_\{\\mathcal\{X\}\}\(x\_\{t\},x^\{\*\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\\leq\\left\(\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}d\_\{\\mathcal\{X\}\}^\{p\}\(x\_\{t\},x^\{\*\}\)\\,\\mathrm\{d\}\\pi\(x\_\{t\},x^\{\*\}\)\\right\)^\{1/p\}=\\mathbb\{W\}\_\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)\.\(168\)
Rearranging terms and absorbing the constant associated withδ\\deltainto the big\-O notation, we conclude the proof:
ℛ𝒟∗\(ht;g∗\)≤ℛ𝒟t\(ht;gt\)\+2ℓϵ⋅𝕎p\(𝒟t,𝒟∗\)\+𝒪\(ℓδ\)\.\\mathcal\{R\}\_\{\\mathcal\{D\}^\{\*\}\}\(h\_\{t\};g^\{\*\}\)\\leq\\mathcal\{R\}\_\{\\mathcal\{D\}\_\{t\}\}\(h\_\{t\};g\_\{t\}\)\+2\\ell\\epsilon\\cdot\\mathbb\{W\}\_\{p\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{D\}^\{\*\}\)\+\\mathcal\{O\}\(\\ell\\delta\)\.\(169\)∎
### B\.7Proof of[Theorem˜4](https://arxiv.org/html/2606.13732#Thmtheorem4)
See[4](https://arxiv.org/html/2606.13732#Thmtheorem4)
###### Proof\.
The proof relies on the geometric properties of Wasserstein space, specifically geodesics and the triangle inequality, following the proof strategy of Theorem 2 in\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\)\. We show that the sequenceℰk\(r\)\\mathcal\{E\}\_\{k\}^\{\(r\)\}is non\-increasing and converges to𝒲p\(𝒬k,𝒫\)\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\mathcal\{P\}\)\.
Recall the definition of the objective sequence at iterationrr\(formulated in terms of the metric𝒲p\\mathcal\{W\}\_\{p\}to satisfy the triangle inequality conditions\):
ℰk\(r\)=𝒲p\(𝒬k,ξk\(r\)\)\+𝒲p\(ξk\(r\),𝒫\)\.\\mathcal\{E\}\_\{k\}^\{\(r\)\}=\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{k\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\)\},\\mathcal\{P\}\)\.\(170\)
1\. Decomposition via Geodesic Interpolants\.In the interpolation step,ξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}is computed as the interpolating measure between𝒫\\mathcal\{P\}and the current proxyξk\(r\)\\xi\_\{k\}^\{\(r\)\}, andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}is the interpolating measure between𝒬k\\mathcal\{Q\}\_\{k\}andξk\(r\)\\xi\_\{k\}^\{\(r\)\}\. According to[Property˜3](https://arxiv.org/html/2606.13732#Thmproperty3)and the definition of interpolating measures in\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\), these points satisfy the equality case of the triangle inequality:
𝒲p\(𝒫,ξk\(r\)\)\\displaystyle\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{k\}^\{\(r\)\}\)=𝒲p\(𝒫,ξ𝒫\(r\)\)\+𝒲p\(ξ𝒫\(r\),ξk\(r\)\),\\displaystyle=\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\),\(171\)𝒲p\(𝒬k,ξk\(r\)\)\\displaystyle\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{k\}^\{\(r\)\}\)=𝒲p\(𝒬k,ξ𝒬k\(r\)\)\+𝒲p\(ξ𝒬k\(r\),ξk\(r\)\)\.\\displaystyle=\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\)\.\(172\)
2\. Monotonicity Analysis\.Consider the next iterationr\+1r\+1\. The new proxyξk\(r\+1\)\\xi\_\{k\}^\{\(r\+1\)\}is constructed as the interpolating measure between the intermediatesξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\. We start by applying the triangle inequality to the new terms inℰk\(r\+1\)\\mathcal\{E\}\_\{k\}^\{\(r\+1\)\}:
𝒲p\(𝒫,ξk\(r\+1\)\)\\displaystyle\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{k\}^\{\(r\+1\)\}\)≤𝒲p\(𝒫,ξ𝒫\(r\)\)\+𝒲p\(ξ𝒫\(r\),ξk\(r\+1\)\),\\displaystyle\\leq\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\+1\)\}\),\(173\)𝒲p\(𝒬k,ξk\(r\+1\)\)\\displaystyle\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{k\}^\{\(r\+1\)\}\)≤𝒲p\(𝒬k,ξ𝒬k\(r\)\)\+𝒲p\(ξ𝒬k\(r\),ξk\(r\+1\)\)\.\\displaystyle\\leq\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\+1\)\}\)\.\(174\)Summing these inequalities gives:
ℰk\(r\+1\)≤𝒲p\(𝒫,ξ𝒫\(r\)\)\+𝒲p\(𝒬k,ξ𝒬k\(r\)\)\+\[𝒲p\(ξ𝒫\(r\),ξk\(r\+1\)\)\+𝒲p\(ξk\(r\+1\),ξ𝒬k\(r\)\)\]\.\\mathcal\{E\}\_\{k\}^\{\(r\+1\)\}\\leq\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\+\\left\[\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\+1\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\+1\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\\right\]\.\(175\)Sinceξk\(r\+1\)\\xi\_\{k\}^\{\(r\+1\)\}is the interpolant betweenξ𝒫\(r\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}andξ𝒬k\(r\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}, it lies on the geodesic connecting them\. Thus, the term in brackets equals the distance between the intermediates:
𝒲p\(ξ𝒫\(r\),ξk\(r\+1\)\)\+𝒲p\(ξk\(r\+1\),ξ𝒬k\(r\)\)=𝒲p\(ξ𝒫\(r\),ξ𝒬k\(r\)\)\.\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\+1\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\+1\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)=\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\.\(176\)Furthermore, applying the triangle inequality to thepreviousproxyξk\(r\)\\xi\_\{k\}^\{\(r\)\}\(which may not lie on the direct geodesic between the two new intermediates\) yields:
𝒲p\(ξ𝒫\(r\),ξ𝒬k\(r\)\)≤𝒲p\(ξ𝒫\(r\),ξk\(r\)\)\+𝒲p\(ξk\(r\),ξ𝒬k\(r\)\)\.\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\\leq\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\.\(177\)Substituting these results back into[Eq\.˜175](https://arxiv.org/html/2606.13732#A2.E175):
ℰk\(r\+1\)\\displaystyle\\mathcal\{E\}\_\{k\}^\{\(r\+1\)\}≤𝒲p\(𝒫,ξ𝒫\(r\)\)\+𝒲p\(𝒬k,ξ𝒬k\(r\)\)\+𝒲p\(ξ𝒫\(r\),ξk\(r\)\)\+𝒲p\(ξk\(r\),ξ𝒬k\(r\)\)\\displaystyle\\leq\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{k\}^\{\(r\)\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)=\(𝒲p\(𝒫,ξ𝒫\(r\)\)\+𝒲p\(ξ𝒫\(r\),ξk\(r\)\)\)\+\(𝒲p\(𝒬k,ξ𝒬k\(r\)\)\+𝒲p\(ξ𝒬k\(r\),ξk\(r\)\)\)\.\\displaystyle=\\left\(\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\)\\right\)\+\\left\(\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)\+\\mathcal\{W\}\_\{p\}\(\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\},\\xi\_\{k\}^\{\(r\)\}\)\\right\)\.\(178\)Using the decompositions from[Eq\.˜171](https://arxiv.org/html/2606.13732#A2.E171)and[Eq\.˜172](https://arxiv.org/html/2606.13732#A2.E172), the right\-hand side is exactlyℰk\(r\)\\mathcal\{E\}\_\{k\}^\{\(r\)\}\. Thus:
ℰk\(r\+1\)≤ℰk\(r\)\.\\mathcal\{E\}\_\{k\}^\{\(r\+1\)\}\\leq\\mathcal\{E\}\_\{k\}^\{\(r\)\}\.\(179\)This confirms that the sequence\{ℰk\(r\)\}r≥0\\\{\\mathcal\{E\}\_\{k\}^\{\(r\)\}\\\}\_\{r\\geq 0\}is non\-increasing\.
3\. Convergence\.By the triangle inequality, for anyξ\\xi, we have𝒲p\(𝒫,𝒬k\)≤𝒲p\(𝒫,ξ\)\+𝒲p\(ξ,𝒬k\)\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\_\{k\}\)\\leq\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\)\+\\mathcal\{W\}\_\{p\}\(\\xi,\\mathcal\{Q\}\_\{k\}\)\. Therefore, the sequence is bounded below by the true transport cost:
ℰk\(r\)≥𝒲p\(𝒫,𝒬k\)\.\\mathcal\{E\}\_\{k\}^\{\(r\)\}\\geq\\mathcal\{W\}\_\{p\}\(\\mathcal\{P\},\\mathcal\{Q\}\_\{k\}\)\.\(180\)Since\{ℰk\(r\)\}\\\{\\mathcal\{E\}\_\{k\}^\{\(r\)\}\\\}is non\-increasing and bounded below, it converges to its infimum by the Monotone Convergence Theorem\. As shown in\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\), asr→∞r\\to\\infty, the iterative interpolation effectively approximates the geodesic path, and the limit satisfies:
limr→∞ℰk\(r\)=𝒲p\(𝒬k,𝒫\)\.\\lim\_\{r\\to\\infty\}\\mathcal\{E\}\_\{k\}^\{\(r\)\}=\\mathcal\{W\}\_\{p\}\(\\mathcal\{Q\}\_\{k\},\\mathcal\{P\}\)\.\(181\)∎
### B\.8Proof of[Theorem˜5](https://arxiv.org/html/2606.13732#Thmtheorem5)
See[5](https://arxiv.org/html/2606.13732#Thmtheorem5)
###### Proof\.
The proof leverages the Majorization–Minimization \(MM\) framework, which guarantees a monotonic decrease in the objective function by optimizing a surrogate function\. We operate in the Wasserstein space𝒲2\\mathcal\{W\}\_\{2\}, where the Fréchet variance allows the construction of a quadratic surrogate despite the lack of a global triangle inequality for the squared metric\.
1\. Notation and Sequence Definition\.The objective function to minimize is the weighted Fréchet variance, defined in the continuous space as:
ℰ\(ξ\)=∑k=1Kλk𝒲22\(𝒬k,ξ\)=∑k=1Kλkinfπk∈Π\(𝒬k,ξ\)∫𝒳×𝒳‖𝐱−𝐲‖2𝑑πk\(𝐱,𝐲\),\\mathcal\{E\}\(\\xi\)=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathcal\{W\}\_\{2\}^\{2\}\(\\mathcal\{Q\}\_\{k\},\\xi\)=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\inf\_\{\\pi\_\{k\}\\in\\Pi\(\\mathcal\{Q\}\_\{k\},\\xi\)\}\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|^\{2\}\\,d\\pi\_\{k\}\(\\mathbf\{x\},\\mathbf\{y\}\),\(182\)whereKKis the total number of target clients, andλk\>0\\lambda\_\{k\}\>0denotes the aggregation weight for clientkksuch that∑k=1Kλk=1\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}=1\.
To ensure absolute clarity and prevent index clutter in the discrete setting, we establish the following explicit notation for the transition from iterationr−1r\-1torr:
- •Let𝒬k\\mathcal\{Q\}\_\{k\}be the empirical target distribution of thekk\-th client\. It is supported onMkM\_\{k\}discrete data points, denoted as the vectors\{𝐱k,j\}j=1Mk\\\{\\mathbf\{x\}\_\{k,j\}\\\}\_\{j=1\}^\{M\_\{k\}\}\.
- •LetSSdenote the fixed number of support points used to approximate the global proxy distribution\.
- •Letξ\(r−1\)\\xi^\{\(r\-1\)\}be the current global proxy\. It is characterized bySSdiscrete support points\{𝐲i\}i=1S\\\{\\mathbf\{y\}\_\{i\}\\\}\_\{i=1\}^\{S\}and their corresponding strictly positive probability weights\{wi\}i=1S\\\{w\_\{i\}\\\}\_\{i=1\}^\{S\}, where∑i=1Swi=1\\sum\_\{i=1\}^\{S\}w\_\{i\}=1\.
- •Letξ\(r\)\\xi^\{\(r\)\}be the updated global proxy at the next iteration, supported on the updated coordinate vectors\{𝐲i\+\}i=1S\\\{\\mathbf\{y\}\_\{i\}^\{\+\}\\\}\_\{i=1\}^\{S\}with the identical probability weights\{wi\}i=1S\\\{w\_\{i\}\\\}\_\{i=1\}^\{S\}\.
2\. Step 1: Majorization via Fixed Transport Plans\.Let𝐏k\(r−1\)\\mathbf\{P\}\_\{k\}^\{\(r\-1\)\}denote the optimal coupling matrix between the current proxyξ\(r−1\)\\xi^\{\(r\-1\)\}and the client data𝒬k\\mathcal\{Q\}\_\{k\}\. The elementPk,ij\(r−1\)P\_\{k,ij\}^\{\(r\-1\)\}explicitly represents the probability mass transported from𝐲i\\mathbf\{y\}\_\{i\}to𝐱k,j\\mathbf\{x\}\_\{k,j\}\. Mass conservation dictates the marginal constraint∑j=1MkPk,ij\(r−1\)=wi\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}=w\_\{i\}\.
For any candidate distributionξ\\xicharacterized by variable support points\{𝐳i\}i=1S\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{S\}, we construct the surrogate upper\-bound objective𝒰\\mathcal\{U\}by fixing these optimal couplings:
𝒰\(ξ,ξ\(r−1\)\)≜∑k=1Kλk∑i=1S∑j=1MkPk,ij\(r−1\)‖𝐱k,j−𝐳i‖2\.\\mathcal\{U\}\(\\xi,\\xi^\{\(r\-1\)\}\)\\triangleq\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\sum\_\{i=1\}^\{S\}\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}\\\|\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{z\}\_\{i\}\\\|^\{2\}\.\(183\)Since𝒲22\\mathcal\{W\}\_\{2\}^\{2\}is defined as the infimum over all possible couplings, evaluating the cost using the fixed coupling𝐏k\(r−1\)\\mathbf\{P\}\_\{k\}^\{\(r\-1\)\}yields a strict majorization bound:
ℰ\(ξ\)≤𝒰\(ξ,ξ\(r−1\)\),∀ξ\.\\mathcal\{E\}\(\\xi\)\\leq\\mathcal\{U\}\(\\xi,\\xi^\{\(r\-1\)\}\),\\quad\\forall\\xi\.\(184\)Evaluated at the current proxyξ\(r−1\)\\xi^\{\(r\-1\)\}\(where𝐳i=𝐲i\\mathbf\{z\}\_\{i\}=\\mathbf\{y\}\_\{i\}\), the bound is tight because the couplings are strictly optimal for this state:ℰ\(ξ\(r−1\)\)=𝒰\(ξ\(r−1\),ξ\(r−1\)\)\\mathcal\{E\}\(\\xi^\{\(r\-1\)\}\)=\\mathcal\{U\}\(\\xi^\{\(r\-1\)\},\\xi^\{\(r\-1\)\}\)\.
3\. Step 2: Minimization via Barycentric Interpolation\.To isolate the optimization variables𝐳i\\mathbf\{z\}\_\{i\}, we define the barycentric projection𝐱^k,i\\mathbf\{\\hat\{x\}\}\_\{k,i\}\. This represents the weighted center of the target data in𝒬k\\mathcal\{Q\}\_\{k\}that is mapped to theii\-th proxy point:
𝐱^k,i=1wi∑j=1MkPk,ij\(r−1\)𝐱k,j\.\\mathbf\{\\hat\{x\}\}\_\{k,i\}=\\frac\{1\}\{w\_\{i\}\}\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}\\mathbf\{x\}\_\{k,j\}\.\(185\)To reveal the strict convexity of the surrogate, we apply the variance decomposition theorem \(parallel axis theorem\) by injecting𝐱^k,i\\mathbf\{\\hat\{x\}\}\_\{k,i\}into the squared distance:‖𝐱k,j−𝐳i‖2=‖\(𝐱k,j−𝐱^k,i\)\+\(𝐱^k,i−𝐳i\)‖2\\\|\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{z\}\_\{i\}\\\|^\{2\}=\\\|\(\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\}\)\+\(\\mathbf\{\\hat\{x\}\}\_\{k,i\}\-\\mathbf\{z\}\_\{i\}\)\\\|^\{2\}\. Expanding this quadratic form and taking the weighted sum overjjyields:
∑j=1MkPk,ij\(r−1\)‖𝐱k,j−𝐳i‖2=wi‖𝐳i−𝐱^k,i‖2\+∑j=1MkPk,ij\(r−1\)‖𝐱k,j−𝐱^k,i‖2\+2⟨𝐳i−𝐱^k,i,∑jPk,ij\(r−1\)\(𝐱^k,i−𝐱k,j\)⟩\.\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}\\\|\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{z\}\_\{i\}\\\|^\{2\}=w\_\{i\}\\\|\\mathbf\{z\}\_\{i\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\}\\\|^\{2\}\+\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}\\\|\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\}\\\|^\{2\}\+2\\langle\\mathbf\{z\}\_\{i\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\},\\sum\_\{j\}P\_\{k,ij\}^\{\(r\-1\)\}\(\\mathbf\{\\hat\{x\}\}\_\{k,i\}\-\\mathbf\{x\}\_\{k,j\}\)\\rangle\.\(186\)By the definition of the barycenter𝐱^k,i\\mathbf\{\\hat\{x\}\}\_\{k,i\}, the inner summation yields∑jPk,ij\(r−1\)𝐱k,j−wi𝐱^k,i=𝟎\\sum\_\{j\}P\_\{k,ij\}^\{\(r\-1\)\}\\mathbf\{x\}\_\{k,j\}\-w\_\{i\}\\mathbf\{\\hat\{x\}\}\_\{k,i\}=\\mathbf\{0\}, causing the cross\-term to vanish exactly\.
Additionally, because the optimal transport plan𝐏k\(r−1\)\\mathbf\{P\}\_\{k\}^\{\(r\-1\)\}and the target data𝐱k,j\\mathbf\{x\}\_\{k,j\}are fixed constants from the previous iteration, the variance term∑j=1MkPk,ij\(r−1\)‖𝐱k,j−𝐱^k,i‖2\\sum\_\{j=1\}^\{M\_\{k\}\}P\_\{k,ij\}^\{\(r\-1\)\}\\\|\\mathbf\{x\}\_\{k,j\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\}\\\|^\{2\}is strictly a constant independent of the optimization variable𝐳i\\mathbf\{z\}\_\{i\}\. Since the cross\-term evaluates to zero and the variance term is a constant, minimizing the surrogate mathematically reduces to optimizing only the remaining term, which is the weighted squared distances to these barycentric projections:
argmin\{𝐳i\}𝒰\(ξ,ξ\(r−1\)\)=argmin\{𝐳i\}∑i=1Swi∑k=1Kλk‖𝐳i−𝐱^k,i‖2\.\\mathop\{\\arg\\min\}\_\{\\\{\\mathbf\{z\}\_\{i\}\\\}\}\\mathcal\{U\}\(\\xi,\\xi^\{\(r\-1\)\}\)=\\mathop\{\\arg\\min\}\_\{\\\{\\mathbf\{z\}\_\{i\}\\\}\}\\sum\_\{i=1\}^\{S\}w\_\{i\}\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\\|\\mathbf\{z\}\_\{i\}\-\\mathbf\{\\hat\{x\}\}\_\{k,i\}\\\|^\{2\}\.\(187\)The unique global minimum of this quadratic objective, denoted by𝐳i∗\\mathbf\{z\}\_\{i\}^\{\*\}, is precisely the convex combination of the projections:
𝐳i∗=∑k=1Kλk𝐱^k,i\.\\mathbf\{z\}\_\{i\}^\{\*\}=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathbf\{\\hat\{x\}\}\_\{k,i\}\.\(188\)In the algorithm, the local interpolation step computes the intermediate client vector𝐯k,i=\(1−t\)𝐲i\+t𝐱^k,i\\mathbf\{v\}\_\{k,i\}=\(1\-t\)\\mathbf\{y\}\_\{i\}\+t\\mathbf\{\\hat\{x\}\}\_\{k,i\}\. The subsequent global update aggregates them to form the new proxy points𝐲i\+\\mathbf\{y\}\_\{i\}^\{\+\}:
𝐲i\+=∑k=1Kλk𝐯k,i=\(1−t\)𝐲i\+t𝐳i∗\.\\mathbf\{y\}\_\{i\}^\{\+\}=\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\mathbf\{v\}\_\{k,i\}=\(1\-t\)\\mathbf\{y\}\_\{i\}\+t\\mathbf\{z\}\_\{i\}^\{\*\}\.\(189\)Fort=1t=1, the update reaches the exact global minimizer𝐳i∗\\mathbf\{z\}\_\{i\}^\{\*\}\. For anyt∈\(0,1\]t\\in\(0,1\], the update constitutes a valid descent step along the gradient of the strictly convex surrogate\. Consequently, the surrogate value monotonically decreases:
𝒰\(ξ\(r\),ξ\(r−1\)\)≤𝒰\(ξ\(r−1\),ξ\(r−1\)\)\.\\mathcal\{U\}\(\\xi^\{\(r\)\},\\xi^\{\(r\-1\)\}\)\\leq\\mathcal\{U\}\(\\xi^\{\(r\-1\)\},\\xi^\{\(r\-1\)\}\)\.\(190\)
4\. Monotonicity and Convergence\.Chaining the majorization property \([Eq\.˜184](https://arxiv.org/html/2606.13732#A2.E184)\), the minimization descent \([Eq\.˜190](https://arxiv.org/html/2606.13732#A2.E190)\), and the tight bound at the previous step, we establish the descent property:
ℰ\(ξ\(r\)\)≤𝒰\(ξ\(r\),ξ\(r−1\)\)≤𝒰\(ξ\(r−1\),ξ\(r−1\)\)=ℰ\(ξ\(r−1\)\)\.\\displaystyle\\mathcal\{E\}\(\\xi^\{\(r\)\}\)\\leq\\mathcal\{U\}\(\\xi^\{\(r\)\},\\xi^\{\(r\-1\)\}\)\\leq\\mathcal\{U\}\(\\xi^\{\(r\-1\)\},\\xi^\{\(r\-1\)\}\)=\\mathcal\{E\}\(\\xi^\{\(r\-1\)\}\)\.\(191\)Since the sequence\{ℰ\(r\)\}\\\{\\mathcal\{E\}^\{\(r\)\}\\\}is non\-increasing and bounded below by 0, it converges to its infimumℰ\(ξ∗\)\\mathcal\{E\}\(\\xi^\{\*\}\)by the Monotone Convergence Theorem\. Due to the displacement convexity of the objective in𝒲2\\mathcal\{W\}\_\{2\}, the limitξ∗\\xi^\{\*\}corresponds to the unique stationary Wasserstein barycenter\. ∎
## Appendix CAdditional Experiments and Details
### C\.1Detailed Wasserstein Ground Metrics
In this section, we provide the explicit formulations of the ground metrics used in our main theoretical framework\. Recall that𝒫p\(Ω\)\\mathcal\{P\}\_\{p\}\(\\Omega\)denotes the space of probability measures with finitepp\-th moments on a complete separable metric spaceΩ\\Omega\. Thepp\-Wasserstein distance between any two distributionsμ,ν∈𝒫p\(Ω\)\\mu,\\nu\\in\\mathcal\{P\}\_\{p\}\(\\Omega\)is formally defined as:
𝕎p\(μ,ν\)=\(infπ∈Π\(μ,ν\)𝔼\(𝐳,𝐳′\)∼π\[dp\(𝐳,𝐳′\)\]\)1/p,\\mathbb\{W\}\_\{p\}\(\\mu,\\nu\)=\\Big\(\\inf\_\{\\pi\\in\\Pi\(\\mu,\\nu\)\}\\mathbb\{E\}\_\{\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\\sim\\pi\}\[d^\{p\}\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\]\\Big\)^\{1/p\},\(192\)whereΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the set of joint distributions \(couplings\) with marginalsμ\\muandν\\nu\. The geometry of this transport space is entirely dictated by the choice of the ground metricd:Ω×Ω→ℝ≥0d:\\Omega\\times\\Omega\\to\\mathbb\{R\}\_\{\\geq 0\}\(Peyréet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib28)\)\.
The primal formulation in[Eq\.˜192](https://arxiv.org/html/2606.13732#A3.E192)is often computationally intractable and theoretically obscure due to the constraints on the couplingπ\\pi\. By the Kantorovich duality theorem\(Villani and others,[2008](https://arxiv.org/html/2606.13732#bib.bib34)\), we can express the Wasserstein distance through its dual form:
𝕎p\(μ,ν\)=sup\(ϕ,ψ\)∈Φc\(𝔼𝐳∼μ\[ϕ\(𝐳\)\]\+𝔼𝐳′∼ν\[ψ\(𝐳′\)\]\),\\mathbb\{W\}\_\{p\}\(\\mu,\\nu\)=\\sup\_\{\(\\phi,\\psi\)\\in\\Phi\_\{c\}\}\\Big\(\\mathbb\{E\}\_\{\\mathbf\{z\}\\sim\\mu\}\[\\phi\(\\mathbf\{z\}\)\]\+\\mathbb\{E\}\_\{\\mathbf\{z\}^\{\\prime\}\\sim\\nu\}\[\\psi\(\\mathbf\{z\}^\{\\prime\}\)\]\\Big\),\(193\)whereΦc=\{\(ϕ,ψ\)∈Cb\(Ω\)2:ϕ\(𝐳\)\+ψ\(𝐳′\)≤dp\(𝐳,𝐳′\)\}\\Phi\_\{c\}=\\\{\(\\phi,\\psi\)\\in C\_\{b\}\(\\Omega\)^\{2\}:\\phi\(\\mathbf\{z\}\)\+\\psi\(\\mathbf\{z\}^\{\\prime\}\)\\leq d^\{p\}\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\\\}denotes the set of admissible Kantorovich potentials\. This dual perspective is particularly instrumental for our theoretical analysis, as it transforms the minimization over couplings into a maximization over scalar functions, relating the transport cost to the geometry induced bydpd^\{p\}\.
Adopting the ground metric construction methodologies established in recent studies\(Liet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib31); Li and Pang,[2024](https://arxiv.org/html/2606.13732#bib.bib30); Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29)\), we formulate modality\-specific metrics to leverage the inherent structural properties of visual and textual data\.
##### Vision Modality \(Linearized Structural Distance\)\.
Directly computing the exact Wasserstein distance between Gaussian distributions involves computationally expensive matrix square roots\. To facilitate efficient computation and align with the vectorized representation of datasets\(Rakotomamonjyet al\.,[2024](https://arxiv.org/html/2606.13732#bib.bib29); Alvarez\-Melis and Fusi,[2020](https://arxiv.org/html/2606.13732#bib.bib24)\), we adopt a parameter\-space approximation\.
Let𝐳=\(𝐱,y\)\\mathbf\{z\}=\(\\mathbf\{x\},y\)be a sample, whereyyparameterizes a class\-conditional Gaussian𝒩\(𝝁y,𝚺y\)\\mathcal\{N\}\(\\bm\{\\mu\}\_\{y\},\\bm\{\\Sigma\}\_\{y\}\)\. We define the squared ground metricdimg2d^\{2\}\_\{\\text\{img\}\}by embedding the Gaussian parameters into a Euclidean space:
dimg\(\(𝐱,y\),\(𝐱′,y′\)\)=\(‖𝐱−𝐱′‖22\+‖𝝁y−𝝁y′‖22\+‖𝚺y−𝚺y′‖F2\)12,\\displaystyle d\_\{\\text\{img\}\}\\big\(\(\\mathbf\{x\},y\),\(\\mathbf\{x\}^\{\\prime\},y^\{\\prime\}\)\\big\)=\\left\(\\\|\\mathbf\{x\}\-\\mathbf\{x\}^\{\\prime\}\\\|\_\{2\}^\{2\}\+\{\\\|\\bm\{\\mu\}\_\{y\}\-\\bm\{\\mu\}\_\{y^\{\\prime\}\}\\\|\_\{2\}^\{2\}\+\\\|\\bm\{\\Sigma\}\_\{y\}\-\\bm\{\\Sigma\}\_\{y^\{\\prime\}\}\\\|\_\{F\}^\{2\}\}\\right\)^\{\\frac\{1\}\{2\}\},\(194\)This formulation corresponds to measuring the Euclidean distance between augmented state vectors𝐱~≔\[𝐱;𝝁y;vec\(𝚺y\)\]\\tilde\{\\mathbf\{x\}\}\\coloneqq\[\\mathbf\{x\};\\bm\{\\mu\}\_\{y\};\\text\{vec\}\(\\bm\{\\Sigma\}\_\{y\}\)\], effectively linearizing the transport cost for high\-dimensional efficiency\.
##### Language Modality \(Semantic Transport Distance\)\.
Euclidean distances between sparse document vectors are often insufficient for capturing semantic proximity\. To address this, we employ the Word Mover’s Distance \(WMD\)\(Kusneret al\.,[2015](https://arxiv.org/html/2606.13732#bib.bib26)\), modeling each document as a discrete distribution in the word embedding space\.
In this setting, a sample𝐳∈Ω\\mathbf\{z\}\\in\\Omegacorresponds to a normalized Bag\-of\-Words \(nBOW\) vector𝐝∈ΔV−1\\mathbf\{d\}\\in\\Delta^\{V\-1\}, whereVVis the vocabulary size andΔV−1\\Delta^\{V\-1\}is the probability simplex\. We define the cost between documents𝐝i\\mathbf\{d\}\_\{i\}and𝐝j\\mathbf\{d\}\_\{j\}via an inner optimal transport problem:
dtext\(𝐝i,𝐝j\)=infT∈Π\(𝐝i,𝐝j\)∑u,vTuv‖𝐰u−𝐰v‖2,d\_\{\\text\{text\}\}\(\\mathbf\{d\}\_\{i\},\\mathbf\{d\}\_\{j\}\)=\\inf\_\{T\\in\\Pi\(\\mathbf\{d\}\_\{i\},\\mathbf\{d\}\_\{j\}\)\}\\sum\_\{u,v\}T\_\{uv\}\\\|\\mathbf\{w\}\_\{u\}\-\\mathbf\{w\}\_\{v\}\\\|\_\{2\},\(195\)where𝐰u∈ℝk\\mathbf\{w\}\_\{u\}\\in\\mathbb\{R\}^\{k\}is the embedding for worduu\(e\.g\., Word2Vec\(Mikolovet al\.,[2013](https://arxiv.org/html/2606.13732#bib.bib27)\)\) andTuvT\_\{uv\}denotes the transport plan between words\. Effectively, this metric quantifies the minimum cumulative cost to semantically transform𝐝i\\mathbf\{d\}\_\{i\}into𝐝j\\mathbf\{d\}\_\{j\}, providing a geometry\-aware ground metric for the subsequent dataset\-level Wasserstein distance\.
### C\.2Detailed Implementation
#### C\.2\.1Detailed Implementation of Baselines
To evaluate the effectiveness of our proposed method, we compare it against a diverse set of selection strategies\. These baselines are image\-based methods that primarily operate in the continuous feature space\.
##### Image Selection Methods\.
We utilize feature embeddings extracted from pre\-trained networks \(e\.g\., Inception\-V3\)\.
- •CovMatch\(Covariance Matching\)\.FollowingRezaeiet al\.\([2026](https://arxiv.org/html/2606.13732#bib.bib6)\), this method aims to preserve the spectral properties of the training data\. It employs a greedy iterative algorithm to minimize the Frobenius norm difference between the empirical covariance matrix of the selected synthetic subset and that of the real training data\. This explicitly aligns second\-order statistics to mitigate spectral shrinkage\.
- •CenterMatch\(Centroid Matching\)\.As described inHeet al\.\([2023](https://arxiv.org/html/2606.13732#bib.bib61)\), this strategy assumes that high\-quality samples reside close to the mode of the distribution\. It computes the global centroid of real training features and selects the generated samples with the smallest Euclidean distances to this centroid\. While this prioritizes high\-density regions, it may reduce diversity by ignoring the distribution’s tails\.
- •K\-means\(Cluster\-based Sampling\)\.Utilized inLinet al\.\([2023](https://arxiv.org/html/2606.13732#bib.bib62)\), this baseline addresses the diversity limitations of centroid\-based methods\. It partitions the generated pool intoNNclusters \(whereNNis the budget size\) using K\-means and selects the sample closest to each cluster center\. This ensures uniform coverage of the synthetic manifold\.
#### C\.2\.2Detailed Implementation of Metrics
To provide a comprehensive assessment of the generated data quality, we employ a combination of metrics measuring both distributional fidelity and sample diversity\.
##### Image Generation Metrics\.
We utilize the Inception\-V3 feature space to compute the following metrics, ensuring a standardized comparison with prior works\.
- •Fréchet Inception Distance \(FID\)\.This metric computes the Fréchet distance between two multivariate Gaussian approximations fitted to the Inception\-V3 feature embeddings of real and generated images\. Concretely, let\(μr,𝚺r\)\(\\mu\_\{r\},\\bm\{\\Sigma\}\_\{r\}\)and\(μg,𝚺g\)\(\\mu\_\{g\},\\bm\{\\Sigma\}\_\{g\}\)be the empirical means and covariance matrices of the real feature distributions𝒟r\\mathcal\{D\}\_\{r\}and generated feature distributions𝒟g\\mathcal\{D\}\_\{g\}, respectively\. The FID is defined as: FID\(𝒟r,𝒟g\)=‖μr−μg‖22\+Tr\(𝚺r\+𝚺g−2\(𝚺r𝚺g\)1/2\)\\text\{FID\}\(\\mathcal\{D\}\_\{r\},\\mathcal\{D\}\_\{g\}\)=\\\|\\mu\_\{r\}\-\\mu\_\{g\}\\\|\_\{2\}^\{2\}\+\\text\{Tr\}\\left\(\\bm\{\\Sigma\}\_\{r\}\+\\bm\{\\Sigma\}\_\{g\}\-2\(\\bm\{\\Sigma\}\_\{r\}\\bm\{\\Sigma\}\_\{g\}\)^\{1/2\}\\right\)\(196\)A lower FID indicates closer alignment of the generated distribution to the real distribution, reflecting higher visual quality and better mode coverage\.
- •Precision and Recall\.To disentangle the evaluation of quality and diversity, we adopt the improved precision and recall metrics\(Kynkäänniemiet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib63)\)\. - –Implementation Details:We compute these metrics in the Inception\-V3 feature space usingkk\-nearest neighbor \(kk\-NN\) manifold estimation withk=5k=5\. The estimation is performed using the generated sample pool \(e\.g\., 10,000 samples per iteration\) against the full training set to ensure statistical consistency\. - –Precision \(Fidelity\):This metric measures the fraction of generated images that lie within the estimated support \(manifold\) of the real data distribution\. It quantifies how "realistic" the generated samples are\. - –Recall \(Diversity\):This metric measures the fraction of real images that lie within the estimated support of the generated data distribution\. A significant drop in Recall is a strong indicator ofmode collapse, implying the generator has failed to capture the full variation of the training data\.
#### C\.2\.3Detailed Implementation of Configurations
We detail the model architectures, datasets, and training protocols used for both image and text generation tasks\. All experiments were conducted following standard configurations established in prior literature to ensure fair comparability\.
##### Image Generation Framework\.
For image synthesis, we adopt the standard Denoising Diffusion Probabilistic Models \(DDPM\) framework\(Hoet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib71)\)\. Following the configuration inShiet al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib5)\), we train the model using a standard U\-Net architecture with attention mechanisms at multiple resolution levels\.
- •Datasets and Preprocessing\.In our experiments, we employ three benchmark datasets representing different levels of complexity\. For the initial training phase \(Generationt=0t=0\), we utilize the 50,000 training samples of each training dataset to establish a strong generator\. - –CIFAR\-10:This dataset comprises 60,000 diverse natural images spanning 10 distinct semantic classes \(e\.g\., vehicles and animals\)\. It is officially partitioned into a training set of 50,000 images and a test set of 10,000 images\. We utilize the complete training set with a native resolution of32×3232\\times 32\. It serves as a fundamental benchmark to evaluate the model’s capability in capturing global semantic structures in a low\-resolution setting\. - –CelebA:This large\-scale face\-attribute dataset contains 202,599 images\. We follow the standard protocol for the official split \(162,770 training, 19,867 validation, and 19,962 testing\)\. To ensure a consistent data scale across experiments, we randomly sample a subset of 50,000 images from the training split\. All images are preprocessed to32×3232\\times 32via center\-cropping\. We map the hair color attributes into five distinct categories: 0: Black Hair, 1: Blond Hair, 2: Brown Hair, 3: Gray Hair, and 4: Other \(including Bald or unknown\)\. - –STL\-10:To evaluate performance on more diverse object structures, we utilize the STL\-10 dataset, which consists of 5,000 labeled training images, 8,000 test images, and 100,000 unlabeled images\. We randomly sample 50,000 images from the combined pool of labeled and unlabeled data to construct the training set\. While the original resolution is96×9696\\times 96, we resize the images to32×3232\\times 32to align with our model configuration\. This dataset challenges the model to capture fine\-grained details within a reduced resolution\.
- •Training and Sampling Configuration:We adopt the Denoising Diffusion Probabilistic Models \(DDPM\)\(Hoet al\.,[2020](https://arxiv.org/html/2606.13732#bib.bib71)\)framework utilizing a standard U\-Net\(Ronnebergeret al\.,[2015](https://arxiv.org/html/2606.13732#bib.bib89)\)architecture\. The diffusion process is trained overT=1000T=1000timesteps employing a linear variance schedule\. For efficient inference, we use the Denoising Diffusion Implicit Models \(DDIM\)\(Songet al\.,[2021](https://arxiv.org/html/2606.13732#bib.bib73)\)algorithm with 50 sampling steps\. In each generation cycle, we synthesize a candidate pool ofN=4nN=4nsamples \(wheren=50,000n=50,000\) and apply our selection mechanism to filter the top\-nnsamples for the subsequent training iteration\.
In all experimental settings, we enforce a consistent selection protocol to ensure a fair comparison\. For a target training budget ofnnsamples per category, we initially generate a larger candidate pool consisting ofN=4nN=4nsynthetic images\. We then apply our selection mechanism \(as described in the methodology\) to filter this candidate pool, retaining exactlynnrepresentative samples while discarding the remaining3n3ninstances\. Based on these selected samples, we evaluate three distinct strategies for constructing the final training set:
- •Replace:This strategy prioritizes current data distributions\. We discard all historical data and construct the training set using exclusively thennnewly selected samples from the current generation step\. Consequently, the training set size is fixed atnn\.
- •Accumulate:This strategy maximizes data volume\. We append thennnewly selected samples to the existing historical dataset\. As a result, the training set size increases bynnat each update step, utilizing the full history of selected synthetic data\.
- •Accumulate\-Subsample:This strategy balances diversity and computational cost\. We first combine thennnewly selected samples with the accumulated historical data\. From this unified pool, we randomly draw a fixed\-size subset ofnnsamples\. This ensures the training set size remains constant atnnwhile maintaining a diverse mixture of current and historical distributions\.
### C\.3Hardware
All models were trained and evaluated on a dedicated server running Ubuntu 20\.04\.2 LTS\. The system is equipped with dual Intel®Xeon®Gold 6442Y CPUs \(48 physical cores, 96 threads @ 2\.60 GHz\), approximately 503 GiB of system memory, and a GPU cluster consisting of 8 NVIDIA L40 GPUs \(48 GB VRAM each\)\. Detailed code and generation results for each round are available at[GitHub](https://github.com/XinbaoQiao/When-Sample-Selection-Bias-Precipitates-Model-Collapse)\.
### C\.4Multivariate Gaussian Modeling and Non\-Gaussian Empirical Evidence
\(a\)n=100n=100
\(b\)n=300n=300
\(c\)n=500n=500
Figure 7:Impact of Sample Size on Variance Collapse\.We visualize the evolution of the variance ratioTr\(𝚺t\)/Tr\(𝚺0\)\\text\{Tr\}\(\\bm\{\\Sigma\}\_\{t\}\)/\\text\{Tr\}\(\\bm\{\\Sigma\}\_\{0\}\)over generations under varying generated sample sizes \(nn\) in each iteration\. While increasing the sample size reduces the collapse speed for the naiveReplacebaseline \(Blue\), the Replace Paradigm with biased selection method \(Black\) consistently exhibits rapid variance decay governed by the power\-law dynamics described in[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2), showing that selection bias dominates finite\-sample effects\.##### Multivariate Gaussian Modeling
While[Section˜3](https://arxiv.org/html/2606.13732#S3)focuses on the representative regime ofn=300n=300to illustrate the core collapse dynamics, we provide here a comprehensive specification of the experimental environment and extend our analysis to varying sample sizes \(n∈\{100,500\}n\\in\\\{100,500\\\}\) to demonstrate the universality of the observed phenomena\.
##### Experimental Settings and Hyperparameters\.
To empirically validate our theoretical findings, we perform controlled simulations on a synthetic multivariate Gaussian benchmark\. The data dimension is fixed atd=10d=10, and the iterative training is tracked overT=300T=300generations\. We initialize the ground truth distribution as𝒩\(𝝁∗,𝚺∗\)\\mathcal\{N\}\(\\bm\{\\mu\}^\{\*\},\\bm\{\\Sigma\}^\{\*\}\)with randomly sampled parameters\. For the selection mechanism, we design a biased utility functionU\(𝐱\)=−‖𝐱−𝐮∗‖2U\(\\mathbf\{x\}\)=\-\\\|\\mathbf\{x\}\-\\mathbf\{u\}^\{\*\}\\\|^\{2\}that favors a local mode𝐮∗\\mathbf\{u\}^\{\*\}distinct from the true mean𝝁∗\\bm\{\\mu\}^\{\*\}\. We enforce a strict selection pressure by setting the retention ratio toα=0\.05\\alpha=0\.05, meaning only the top 5% of generated samples with the highest utility scores are used for training the next generation\. We vary the generated sample sizen∈\{100,300,500\}n\\in\\\{100,300,500\\\}in each iteration to study finite\-sample effects\.
##### Results: The Dominance of Selection Bias\.
[Figure˜7](https://arxiv.org/html/2606.13732#A3.F7)illustrates the evolution of the variance ratioTr\(𝚺t\)/Tr\(𝚺0\)\\text\{Tr\}\(\\bm\{\\Sigma\}\_\{t\}\)/\\text\{Tr\}\(\\bm\{\\Sigma\}\_\{0\}\)across different sample sizes\. The comparative results highlight two fundamentally different collapse regimes:
- •Buffering Effect against Selection:When selection is applied to accumulated data \(Accumulate & Selection\), the variance decay is markedly slower than in theReplace & Selectioncase\. Historical data act as a “memory anchor,” diluting the concentration effect of biased synthetic samples\. However, the downward trend persists, indicating that while accumulation delays collapse, it cannot fully neutralize the long\-term distributional drift induced by a fixed utility function\.
- •Mitigation in Naive Replacement \(nn\-dependent\):In the absence of selection \(theReplacebaseline\), the collapse rate is highly sensitive to the sample size\. As observed in[Figure˜7\(a\)](https://arxiv.org/html/2606.13732#A3.F7.sf1)\(n=100n=100\) versus[Figure˜7\(c\)](https://arxiv.org/html/2606.13732#A3.F7.sf3)\(n=500n=500\), increasing the sample size significantly delays the variance decay\. This aligns with standard statistical theory: largernnreduces the variance of the sample covariance estimator \(scaling roughly withO\(1/n\)O\(1/n\)\), thereby preserving the distributional width for longer periods\.
- •Invariance in Selection Dynamics \(nn\-independent\):Conversely,Replace & SelectionandAccumulate & Selectiondemonstrate a striking insensitivity to sample size\. Whethern=100n=100orn=500n=500, the variance collapses rapidly and follows a nearly identical trajectory\. This observation is crucial: it indicates that the selection bias acts as the dominant force, overshadowing the benefits of increased data volume\. The selection mechanism effectively truncates the distribution’s tails at each step, forcing a contraction rate that adheres to the power\-law dynamics predicted in[Theorem˜2](https://arxiv.org/html/2606.13732#Thmtheorem2)\. Consequently, simply scaling up the synthetic data generation \(increasingnn\) fails to counteract the structural collapse induced by biased selection\.

\(a\) Anisotropic Gaussian

\(b\) Balanced mixture

\(c\) Imbalanced mixture

\(d\) Laplace
Figure 8:Recursive selection dynamics beyond the isotropic Gaussian model, reported forn=300n=300\. Across structured, multimodal, imbalanced, and heavy\-tailed distributions, local\-reference selection sharply reduces normalized dispersion\.
##### Non\-Gaussian recursive selection dynamics\.
The theoretical results in[Section˜3](https://arxiv.org/html/2606.13732#S3)are derived under a Gaussian selection model, which provides a tractable setting for analyzing how local\-reference filtering affects the mean, covariance, and long\-term diversity of recursively generated data\. This assumption is useful for obtaining explicit collapse dynamics and the associated power\-law decay, but it also raises a natural question: whether the same qualitative mechanism persists when the data distribution departs from the Gaussian structure used in the analysis\. We therefore include additional empirical checks that explore the selection process along several axes not covered by the main theory, including anisotropy, multimodality, component imbalance, heavy\-tailed variation, semantic verifier bias, and weaker generative initialization\.
These experiments should be interpreted as empirical robustness evidence rather than as formal extensions of[Theorems˜1](https://arxiv.org/html/2606.13732#Thmtheorem1)and[2](https://arxiv.org/html/2606.13732#Thmtheorem2)\. The goal is not to claim that the Gaussian power\-law rate holds unchanged in all these settings\. Instead, we examine whether the mechanism identified in the main text remains observable: when the verifier relies on a restricted local reference, recursive selection tends to preserve samples aligned with that reference while suppressing regions that are underrepresented, distant, or semantically outside the verifier’s domain\. Across the following settings, the results consistently support this qualitative interpretation\.
We first consider synthetic distributions that isolate different forms of distributional structure beyond the isotropic Gaussian case\. The anisotropic Gaussian control tests whether collapse persists when the distribution has direction\-dependent variance\. The balanced mixture setting introduces multimodality, allowing us to examine whether selection preserves multiple separated components or concentrates mass around a subset of them\. The imbalanced mixture further introduces minority modes, which are particularly relevant to our claim that biased selection can remove weakly represented regions of the distribution\. Finally, the Laplace family introduces heavier tails, testing whether local\-reference selection suppresses tail scale rather than only collapsing Gaussian\-like variance\. For each distribution family, we track the normalized dispersionDispRatio\(Pt\)=Disp\(Pt\)Disp\(P0\),\\operatorname\{DispRatio\}\(P\_\{t\}\)=\\frac\{\\operatorname\{Disp\}\(P\_\{t\}\)\}\{\\operatorname\{Disp\}\(P\_\{0\}\)\},where smaller values indicate stronger loss of distributional diversity relative to the initial distribution\. This metric provides a common scale across distribution families and sample sizes, allowing us to compare how recursive accumulation, replacement, and selection affect the preservation of distributional spread\. In addition to this aggregate dispersion measure, we report family\-specific diagnostics for the imbalanced mixture and Laplace settings, since these two cases expose more structured forms of collapse: erosion of minority components and contraction of heavy\-tailed variation\.
Table 2:Final dispersion ratios at iterationt=200t=200\. Lower values indicate stronger diversity collapse\. The omittedReplace\+Selectcolumn equals0\.00000\.0000for every distribution family and sample size, and is therefore reported in the caption rather than as a degenerate all\-zero column\.Table 3:Minority\-mode diagnostics for the imbalanced mixture at iterationt=200t=200\. Lower smallest\-component weight and lower entropy indicate stronger erosion of low\-probability modes\.Table 4:Tail\-scale diagnostics for the Laplace family at iterationt=200t=200\. Lower mean absolute scale indicates stronger contraction of heavy\-tailed variation\.The trajectories in[Figure˜8](https://arxiv.org/html/2606.13732#A3.F8)reveal several consistent patterns\. First, accumulation without selection remains close to the initial dispersion across all four distribution families, indicating that recursive accumulation alone does not necessarily induce immediate diversity loss in these controlled settings\. Second, local\-reference selection introduces a strong dissipative effect\. In the anisotropic Gaussian, balanced mixture, and imbalanced mixture settings,Accumulate\+Selectdecreases monotonically or near\-monotonically from unit dispersion to a small residual level, whereasReplace\+Selectcollapses almost immediately to numerical zero\. This sharp contrast indicates that the diversity loss is not caused solely by recursive resampling, but by the interaction between recursion and selective filtering toward a restricted reference\.
The behavior of theReplacebaseline further clarifies the role of selection\. Without selection, replacement dynamics can be noisy and distribution\-dependent: in the mixture settings, dispersion decreases gradually but remains substantially above zero, while in the Laplace setting it can even expand beyond the initial dispersion due to heavy\-tailed fluctuations\. Once selection is applied, however, this instability is converted into rapid collapse\. In particular,Replace\+Selectreaches a final dispersion ratio of0\.00000\.0000for every tested distribution family and sample size in[Table˜2](https://arxiv.org/html/2606.13732#A3.T2)\. This all\-zero outcome is not a missing\-data artifact; it reflects a degenerate regime in which selection repeatedly concentrates the generated distribution around the local reference and eliminates measurable spread\.
The complete dispersion statistics in[Table˜2](https://arxiv.org/html/2606.13732#A3.T2)also show that the phenomenon is stable across sample sizes\. ForAccumulate, the final dispersion ratio remains close to one in most cases, ranging from0\.94550\.9455to1\.04391\.0439across the reported non\-Gaussian families and sample sizes\. In contrast,Accumulate\+Selectconsistently reduces dispersion by an order of magnitude or more, with final ratios below0\.150\.15in all reported cases and below0\.020\.02for the Laplace family\. This pattern is important because it separates the effect of biased selection from finite\-sample randomness: increasing the sample size fromn=100n=100ton=500n=500does not remove the collapse trend\. Instead, the same qualitative behavior persists across structured, multimodal, imbalanced, and heavy\-tailed regimes\.
Beyond aggregate dispersion, the imbalanced\-mixture and Laplace settings provide targeted diagnostics for the type of diversity being lost\. The imbalanced mixture tests whether selection erodes minority components rather than merely reducing total variance\. As shown in[Table˜3](https://arxiv.org/html/2606.13732#A3.T3), selection lowers both the smallest component weight and the component\-weight entropy under accumulation, indicating that low\-probability modes become less represented even when the overall recursive process remains stable\. Under replacement, the effect is stronger: the minority component disappears completely forn=100n=100andn=300n=300underReplace\+Select, and is reduced to a near\-zero weight atn=500n=500\. This supports the interpretation that local\-reference selection preferentially suppresses modes that are weakly represented or distant from the verifier’s reference\.
The Laplace diagnostics in[Table˜4](https://arxiv.org/html/2606.13732#A3.T4)show a complementary form of collapse\. Since the Laplace family has heavier tails, mean absolute scale measures whether tail magnitude is retained\. Accumulation preserves this scale near its initial level, whileAccumulate\+Selectreduces it from approximately2\.02\.0to about0\.240\.24across all sample sizes\.Replace\+Selectfurther contracts the scale to nearly zero\. Thus, biased filtering does not merely collapse multimodal structure; it can also remove heavy\-tailed variation\. Together, these diagnostics indicate that the dispersion collapse observed in[Figure˜8](https://arxiv.org/html/2606.13732#A3.F8)corresponds to structured loss of distributional coverage, including minority\-mode erosion and tail\-scale contraction\.
### C\.5Algorithms and Empirical Results
In this section, we present the pseudocode for our algorithm, followed by additional experiments, including\(i\)an analysis of the computational time of[Algorithms˜1](https://arxiv.org/html/2606.13732#alg1)and[2](https://arxiv.org/html/2606.13732#alg2)and\(ii\)the impact of varying Dirichlet settings\.
Algorithm 1Scheme I: Collaborative Geodesic Interpolation & Greedy Selection1:Input:Synthetic data
𝒫\\mathcal\{P\}, real data shards
\{𝒬k\}k=1K\\\{\\mathcal\{Q\}\_\{k\}\\\}\_\{k=1\}^\{K\}, rounds
RR, budget
nn\.
2:Output:Selected subset
ℐ\\mathcal\{I\}\.
3:// Phase 1: Geodesic Proxy Estimation\{See[Figure˜2](https://arxiv.org/html/2606.13732#S4.F2)\}
4:Initialize proxies
ξk\(0\)\\xi\_\{k\}^\{\(0\)\}for all
kk
5:for
r=0r=0to
R−1R\-1do
6:for
k=1k=1to
KKin paralleldo
7:Interpolation:
ξ𝒬k\(r\)←argminξ𝕎p\(𝒬k,ξ\)\+𝕎p\(ξ,ξk\(r\)\)\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\\leftarrow\\arg\\min\_\{\\xi\}\\mathbb\{W\}\_\{p\}\\left\(\\mathcal\{Q\}\_\{k\},\\xi\\right\)\+\\mathbb\{W\}\_\{p\}\(\\xi,\\xi\_\{k\}^\{\(r\)\}\)\{See[Property˜2](https://arxiv.org/html/2606.13732#Thmproperty2)\}
8:Interpolation:
ξ𝒫\(r\)←argminξ𝕎p\(𝒫,ξ\)\+𝕎p\(ξ,ξk\(r\)\)\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\}\\leftarrow\\arg\\min\_\{\\xi\}\\mathbb\{W\}\_\{p\}\(\\mathcal\{P\},\\xi\)\+\\mathbb\{W\}\_\{p\}\(\\xi,\\xi\_\{k\}^\{\(r\)\}\)
9:Update proxy:
ξk\(r\+1\)←argminξ𝕎p\(ξ𝒫\(r\),ξ\)\+𝕎p\(ξ,ξ𝒬k\(r\)\)\\xi\_\{k\}^\{\(r\+1\)\}\\leftarrow\\arg\\min\_\{\\xi\}\\mathbb\{W\}\_\{p\}\\left\(\\xi\_\{\\mathcal\{P\}\}^\{\(r\)\},\\xi\\right\)\+\\mathbb\{W\}\_\{p\}\(\\xi,\\xi\_\{\\mathcal\{Q\}\_\{k\}\}^\{\(r\)\}\)
10:endfor
11:endfor
12:Set final proxies
ξk∗←ξk\(R\)\\xi\_\{k\}^\{\*\}\\leftarrow\\xi\_\{k\}^\{\(R\)\}\{Converges via[Theorem˜4](https://arxiv.org/html/2606.13732#Thmtheorem4)\}
13:// Phase 2: Greedy Selection
14:for
k=1k=1to
KKin paralleldo
15:Compute scores:
𝒮k\(xi\)←f∗\(xi\)−1N−1∑j≠iNf∗\(xj\)\\mathcal\{S\}\_\{k\}\(x\_\{i\}\)\\leftarrow f^\{\*\}\(x\_\{i\}\)\-\\frac\{1\}\{N\-1\}\\sum\_\{j\\neq i\}^\{N\}f^\{\*\}\(x\_\{j\}\)\{[Eq\.˜18](https://arxiv.org/html/2606.13732#S4.E18)\}
16:Min–max normalization:
𝒮~k\(xi\)=𝒮k\(xi\)−minj𝒮k\(xj\)maxj𝒮k\(xj\)−minj𝒮k\(xj\)\\tilde\{\\mathcal\{S\}\}\_\{k\}\(x\_\{i\}\)=\\frac\{\\mathcal\{S\}\_\{k\}\(x\_\{i\}\)\-\\min\_\{j\}\\mathcal\{S\}\_\{k\}\(x\_\{j\}\)\}\{\\max\_\{j\}\\mathcal\{S\}\_\{k\}\(x\_\{j\}\)\-\\min\_\{j\}\\mathcal\{S\}\_\{k\}\(x\_\{j\}\)\}
17:endfor
18:Initialize:
ℐ←∅\\mathcal\{I\}\\leftarrow\\emptyset
19:Greedy selection:
maximizeℐ⊆\{1,…,N\}∑k=1Kg\(∑i∈ℐ\(1−𝒮~k\(xi\)\)\)s\.t\.\|ℐ\|≤n\.\\underset\{\\mathcal\{I\}\\subseteq\\\{1,\\dots,N\\\}\}\{\\text\{maximize\}\}\\quad\\sum\_\{k=1\}^\{K\}g\\big\(\\sum\_\{i\\in\\mathcal\{I\}\}\(1\-\\tilde\{\\mathcal\{S\}\}\_\{k\}\(x\_\{i\}\)\)\\big\)\\quad\\text\{s\.t\.\}\\ \\ \|\\mathcal\{I\}\|\\leq n\.
20:Return
ℐ\\mathcal\{I\}
Algorithm 2Scheme II: Collaborative Barycenter Estimation1:Input:Synthetic data
𝒫\\mathcal\{P\}, real data shards
\{𝒬k\}k=1K\\\{\\mathcal\{Q\}\_\{k\}\\\}\_\{k=1\}^\{K\}, rounds
RR, selection ratio
α\\alpha\.
2:Output:Selected indices
ℐ\\mathcal\{I\}\.
3:// Phase 1: Offline Barycenter Estimation
4:Server:Initializes global barycenter
ξ\(0\)\\xi^\{\(0\)\}\{See[Figure˜3](https://arxiv.org/html/2606.13732#S4.F3)\}
5:for
r=0r=0to
R−1R\-1do
6:Server:Broadcasts
ξ\(r\)\\xi^\{\(r\)\}to all parties
7:for
k=1k=1to
KKin paralleldo
8:Clientkk:Computes local interpolation
9:
ξk\(r\)←argminξ𝕎p\(𝒬k,ξ\)\+𝕎p\(ξ,ξ\(r\)\)\\quad\\xi\_\{k\}^\{\(r\)\}\\leftarrow\\arg\\min\_\{\\xi\}\\mathbb\{W\}\_\{p\}\\left\(\\mathcal\{Q\}\_\{k\},\\xi\\right\)\+\\mathbb\{W\}\_\{p\}\\left\(\\xi,\\xi^\{\(r\)\}\\right\)
10:Clientkk:Sends
ξk\(r\)\\xi\_\{k\}^\{\(r\)\}to Server
11:endfor
12:Server:Updates barycenter\{See[Property˜1](https://arxiv.org/html/2606.13732#Thmproperty1)\}
13:
ξ\(r\+1\)←argminξ∑k=1K𝕎p\(ξ,ξk\(r\)\)\\quad\\xi^\{\(r\+1\)\}\\leftarrow\\arg\\min\_\{\\xi\}\\sum\_\{k=1\}^\{K\}\\mathbb\{W\}\_\{p\}\\left\(\\xi,\\xi\_\{k\}^\{\(r\)\}\\right\)
14:endfor
15:
ξ∗←ξ\(R\)\\xi^\{\*\}\\leftarrow\\xi^\{\(R\)\}\{Converges via[Theorem˜5](https://arxiv.org/html/2606.13732#Thmtheorem5)\}
16:// Phase 2: Calibrated Gradient Filtering
17:foreach
xi∈𝒫x\_\{i\}\\in\\mathcal\{P\}do
18:Compute Score:
𝒮\(xi\)←f∗\(xi\)−1N−1∑j≠iNf∗\(xj\)\\mathcal\{S\}\(x\_\{i\}\)\\leftarrow f^\{\*\}\(x\_\{i\}\)\-\\frac\{1\}\{N\-1\}\\sum\_\{j\\neq i\}^\{N\}f^\{\*\}\(x\_\{j\}\)\{[Eq\.˜18](https://arxiv.org/html/2606.13732#S4.E18)\}
19:endfor
20:
ℐ←argtopk\(\{−𝒮\(xi\)\},k=α\|𝒫\|\)\\mathcal\{I\}\\leftarrow\\operatorname\{arg\\,topk\}\(\\\{\-\\mathcal\{S\}\(x\_\{i\}\)\\\},k=\\alpha\|\\mathcal\{P\}\|\)
21:Return
ℐ\\mathcal\{I\}
##### Data Heterogeneity\.
As illustrated in[Figure˜9](https://arxiv.org/html/2606.13732#A3.F9),Scheme Iexhibits superior robustness in highly non\-IID regimes \(α<0\.5\\alpha<0\.5\) by effectively leveraging heterogeneity as a diversity source\. In contrast,Scheme IIshows significant performance gains as the data distribution becomes more homogeneous \(α→1\.0\\alpha\\to 1\.0\) with lower computational cost\.



Figure 9:Evolution of the Training Dataset under the Replace Paradigm\.\(Left\)Snapshot at Iteration 1\.\(Middle\)Snapshot at Iteration 10\.\(Right\) Impact of Data Heterogeneity\.We report FID scores under varying Dirichlet concentration parameters \(α\\alpha\)\.- •Heterogeneity Transforms into Diversity\.Scheme I \(Circles\) consistently outperforms Scheme II \(Squares\) in terms of generation quality \(lower FID\), particularly in highly non\-IID settings \(low Dirichlet concentrationα<0\.5\\alpha<0\.5\)\. As illustrated in the plot, Scheme I maintains a stable low FID \(≈56\\approx 56\) even under extreme heterogeneity\. This validates the theoretical premise in[Subsection˜4\.3](https://arxiv.org/html/2606.13732#S4.SS3): by computing interpolations along geodesics to multiple distinct targets rather than a single aggregated mean, Scheme I preserves the unique distributional modes of disjoint data shards\.
- •Efficiency and Proxy Validity in Homogeneous Regimes\.As the data distribution becomes more homogeneous \(Dirichletα→1\.0\\alpha\\to 1\.0\), the performance gap between the two methods narrows significantly, with Scheme II showing a steep improvement in FID \(dropping from≈72\\approx 72to≈65\\approx 65\)\. This indicates that when the divergence between local distributions is low, the Wasserstein barycenter \(Scheme II\) becomes a statistically valid proxy for the ground truth\.
##### Topic\-local verification in recursive LLM training\.
Figure 10:Held\-out topic generalization under recursive training with a technology\-local verifier\. ROUGE\-based local selection drops early and stabilizes below random selection\.We further examine whether local\-reference selection exhibits a similar failure mode when the verifier bias is semantic rather than geometric\. Following the verification\-based recursive training setup ofFenget al\.\([2025](https://arxiv.org/html/2606.13732#bib.bib4)\), we fine\-tune Llama\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2606.13732#bib.bib72)\)on the English subset of XLSum\(Hasanet al\.,[2021](https://arxiv.org/html/2606.13732#bib.bib70)\)\. To instantiate a siloed semantic\-reference setting, we partition XLSum by topic so that each entity observes only one topical subset as its local data\. In the experiment shown in[Figure˜10](https://arxiv.org/html/2606.13732#A3.F10), the verifier is constructed from the technology subset, and generated samples are filtered according to their ROUGE\-based selection with this technology\-local reference\(Lin,[2004](https://arxiv.org/html/2606.13732#bib.bib65)\)\. Recursive training is then evaluated on held\-out non\-technology topics\.
This experiment is not intended as a comprehensive LLM benchmark\. Its purpose is narrower: to test whether a verifier that enforces local semantic alignment can simultaneously degrade coverage outside its reference domain\. In this sense, the setup reflects a low\-resource verification condition: the verifier has access only to a restricted topical slice, while held\-out topics behave as underrepresented regions of the target distribution\. This framing is consistent with recent concerns that model collapse can be especially consequential when tail information is scarce or weakly represented\(Jarviset al\.,[2026](https://arxiv.org/html/2606.13732#bib.bib92)\)\. As shown in[Figure˜10](https://arxiv.org/html/2606.13732#A3.F10), ROUGE\-based local selection yields weaker held\-out topic generalization than random selection\. The degradation occurs early in recursive training and persists across later iterations, suggesting that filtering against a narrow topical reference can suppress samples useful for non\-local semantic regions\. This observation supports the broader mechanism analyzed in the main text: when the verifier’s reference is semantically narrow, sample selection can become a source of coverage loss rather than a safeguard against collapse\.
### C\.6Differentially Private Optimal Transport
To provide formal privacy guarantees beyond the implicit protection of barycentric compression, we integrate the Differentially Private Optimal Transport \(DPOT\) framework\(Lê Tienet al\.,[2019](https://arxiv.org/html/2606.13732#bib.bib23)\)\. This method leverages the Johnson–Lindenstrauss transform\(Kenthapadiet al\.,[2013](https://arxiv.org/html/2606.13732#bib.bib88)\)to sanitize the pairwise distance matrix required for optimal transport\. Let𝐗𝒫∈ℝn×d\\mathbf\{X\}\_\{\\mathcal\{P\}\}\\in\\mathbb\{R\}^\{n\\times d\}and𝐗𝒬∈ℝm×d\\mathbf\{X\}\_\{\\mathcal\{Q\}\}\\in\\mathbb\{R\}^\{m\\times d\}be the data matrices associated with probability measures𝒫\\mathcal\{P\}and𝒬\\mathcal\{Q\}\. The party holding𝒫\\mathcal\{P\}generates a random Gaussian projection matrix𝐌∈ℝd×l\\mathbf\{M\}\\in\\mathbb\{R\}^\{d\\times l\}\(entries i\.i\.d\.∼𝒩\(0,1/l\)\\sim\\mathcal\{N\}\(0,1/l\)\) and adds a noise matrix𝚫\\mathbf\{\\Delta\}\(entries∼𝒩\(0,σ\)\\sim\\mathcal\{N\}\(0,\\sigma\)\) to release a sanitized view𝐗~𝒫=𝐗𝒫𝐌\+𝚫\\tilde\{\\mathbf\{X\}\}\_\{\\mathcal\{P\}\}=\\mathbf\{X\}\_\{\\mathcal\{P\}\}\\mathbf\{M\}\+\\mathbf\{\\Delta\}\. The receiving party holding𝒬\\mathcal\{Q\}then projects their data as𝐗~𝒬=𝐗𝒬𝐌\\tilde\{\\mathbf\{X\}\}\_\{\\mathcal\{Q\}\}=\\mathbf\{X\}\_\{\\mathcal\{Q\}\}\\mathbf\{M\}and approximates the cost matrix as𝐂~ij=‖\(𝐗~𝒫\)i−\(𝐗~𝒬\)j‖2−lσ2\\tilde\{\\mathbf\{C\}\}\_\{ij\}=\\\|\(\\tilde\{\\mathbf\{X\}\}\_\{\\mathcal\{P\}\}\)\_\{i\}\-\(\\tilde\{\\mathbf\{X\}\}\_\{\\mathcal\{Q\}\}\)\_\{j\}\\\|^\{2\}\-l\\sigma^\{2\}, where the termlσ2l\\sigma^\{2\}corrects the bias induced by the noise\. This mechanism satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-differential privacy provided thatσ≥w2\(ln\(1/2δ\)\+ϵ\)/ϵ\\sigma\\geq w\\sqrt\{2\(\\ln\(1/2\\delta\)\+\\epsilon\)\}/\\epsilon, wherewwis the sensitivity of the projection\.
To validate the theoretical underpinnings of our data selection method, we analyze the sensitivity of the optimal transport \(OT\) distance to perturbations in the probability mass of individual data points\. We conduct this experiment using subsets of theCIFAR\-10dataset, withN=5,000N=5,000samples randomly selected for both the source and target distributions\.
[Figure 11](https://arxiv.org/html/2606.13732#A3.F11)visualizes the relationship between thepredictedchange in OT distance \(derived from our gradient\-based scoring\) and theactualnumerical change observed when re\-solving the exact OT problem\. The perturbation involves varying the probability mass of a single target point from−100%\-100\\%to\+100%\+100\\%\. For the privacy\-preserving scenario, we employ a privacy budget ofϵ=1\.0\\epsilon=1\.0, an interpolation factor oft=0\.5t=0\.5, and report the average predictions over 30 independent trials to account for the stochasticity of the DP mechanism\.
- •\(a\) Validation of the Direct Gradient:Figure[11](https://arxiv.org/html/2606.13732#A3.F11)\(a\) compares the gradient\-based prediction using standard dual potentials \(Green points\) against the ground\-truth change in Wasserstein distance \(Black dashed line\)\. This empirically verifies that the scoring mechanism accurately reflects the contribution of each data to the overall distributional distance\.
- •\(b\) Effectiveness of the Proxy Gradient:In Figure[11](https://arxiv.org/html/2606.13732#A3.F11)\(b\), we evaluate the accuracy of gradients estimated using aProxy Gradientconstructed via geodesic interpolation att=0\.5t=0\.5\(Light red points\)\. The proxy\-based gradients exhibit a near\-perfect correlation with the direct gradient \(Green line\)\. This result justifies the use of proxies for data selection, demonstrating that full transport to the final target is not strictly necessary to identify high\-value samples\.
- •\(c\) Robustness to Differential Privacy:Figure[11](https://arxiv.org/html/2606.13732#A3.F11)\(c\) demonstrates the method’s robustness under a strict privacy budget \(ϵ=1\.0\\epsilon=1\.0\)\. Despite the injection of noise into the proxy target construction \(Red points\), the estimated gradients maintain a strong positive correlation with the clean direct gradient \(Green line\)\. While the individual points exhibit variance due to the DP noise, the overall trend remains linear and directionally correct, ensuring that the selection mechanism remains effective even when the target distribution is private\. We empirically show that there is a negligible difference in FID results between data selected via the noisy proxy and data selected via the direct method\.
Figure 11:Predicting the change in OT distance when increasing or reducing the probability mass of a point on the CIFAR\-10 dataset\.\(a\) Direct Gradient:Thepredicted change in Wasserstein distance\(Green points,\) computed via standard dual potentials accurately matches theactual numerical change\(Black dashed line,\), confirming the theoretical validity of using dual potentials as influence scores\.\(b\) Proxy Gradient:The gradient estimated using theProxy target\(geodesic interpolation att=0\.5t=0\.5\) \(Light red points,\) aligns perfectly with the Direct Gradient, justifying the use of the proxy distribution for efficient selection\.\(c\) Differential Privacy:Under a privacy budget ofϵ=1\.0\\epsilon=1\.0, theNoisy Proxy target\(Red points,\) remains highly correlated with the clean gradients despite the added noise, demonstrating the robustness of our method for private data selection\.Similar Articles
Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
This paper proposes a bilayer coupled SIR/SIRS framework to model synthetic data contamination and model collapse in AI ecosystems, showing that cross-contamination between models and data corpora leads to supercritical dynamics and identifying detection-based filtering as a key intervention.
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.
Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering
This paper investigates whether per-phase metric selection improves demonstration curation for behavior cloning policies. The authors find that phase-gated curation never outperforms global or uniform metric application, and the dilution of defect signals across phases explains the failure.
False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control
This paper audits the reliability of distribution-free risk control methods for selective classification in signal-domain detectors, finding that naive thresholding often exceeds its declared budget and that exchangeability violations cause certificate failures.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
This paper identifies that language model reasoning trajectories during test-time sampling cluster into 'reasoning basins', causing majority vote failures when the dominant basin is incorrect. It introduces ARBITER, a model-agnostic method that uses conservative additive evidence from the model's own outputs and hidden states to improve accuracy without external data.