MIND: Monge Inception Distance for Generative Models Evaluation

arXiv cs.LG 05/11/26, 04:00 AM Papers
Summary
This paper introduces MIND (Monge Inception Distance), a new metric for evaluating generative models that is more sample-efficient, faster, and robust than the standard Fréchet Inception Distance (FID).
arXiv:2605.06797v1 Announce Type: new Abstract: We propose the Monge Inception Distance (MIND), a metric for evaluating generative models that addresses key limitations of the widely adopted Fr\'echet Inception Distance (FID). The MIND metric leverages the sliced Wasserstein distance to compare distributions by averaging one-dimensional optimal transport distances, efficiently computed via sorting. This approach circumvents the estimation of high-dimensional means and covariance matrices, which underlie FID's poor sample complexity and vulnerability to adversarial attacks. We empirically demonstrate three primary advantages: (i) it is more sample-efficient by one order of magnitude, (ii) it is faster to compute by two orders of magnitude, (iii) it is more robust to adversarial attacks such as moment-matching. We show that MIND with 5k samples can replace the evaluation performance of FID with 50k samples, providing high correlation with this standard benchmark and superior discriminative performance. We further demonstrate that even smaller sample sizes (e.g., 1k or 2k) remain highly informative for rapid model iteration.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:51 AM
# Monge Inception Distance for Generative Models Evaluation
Source: [https://arxiv.org/html/2605.06797](https://arxiv.org/html/2605.06797)
Quentin Berthet1Yu\-Han Wu1,2Clément Crepy1 Romuald Elie1Klaus Greff1Michaël E\. Sander1 1Google DeepMind2LPSM, Sorbonne Université

###### Abstract

We propose the Monge Inception Distance \(MIND\), a metric for evaluating generative models that addresses key limitations of the widely adopted Fréchet Inception Distance \(FID\)\. TheMINDmetric leverages the sliced Wasserstein distance to compare distributions by averaging one\-dimensional optimal transport distances, efficiently computed via sorting\. This approach circumvents the estimation of high\-dimensional means and covariance matrices, which underlie FID’s poor sample complexity and vulnerability to adversarial attacks\. We empirically demonstrate three primary advantages: \(i\) it is more sample\-efficient by one order of magnitude, \(ii\) it is faster to compute by two orders of magnitude, \(iii\) it is more robust to adversarial attacks such as moment\-matching\. We show thatMINDwith 5k samples can replace the evaluation performance of FID with 50k samples, providing high correlation with this standard benchmark and superior discriminative performance\. We further demonstrate that even smaller sample sizes \(e\.g\., 1k or 2k\) remain highly informative for rapid model iteration\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x1.png)Figure 1:\(Left\)MINDmetric during a diffusion model training run on ImageNet\-64 \(log scale\), illustrating howMIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}can be used to replaceFID50k\\text\{FID\}\_\{50k\}, with a larger range \- see Section[4\.3](https://arxiv.org/html/2605.06797#S4.SS3)\. \(Right\) Correlation with number of training steps \- better forMIND1k\\operatorname\{\\texttt\{MIND\}\}\_\{1k\}andMIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}than FID with5050k samples\.## 1Introduction

Generative models, especially diffusion models\(Hoet al\.,[2020](https://arxiv.org/html/2605.06797#bib.bib33)\), have set new standards in high\-quality data synthesis\. This progress has spurred innovation across numerous fields, from creative arts to scientific simulation\. However, as models grow in complexity, the metrics used to evaluate them have struggled to keep pace\(Steinet al\.,[2023](https://arxiv.org/html/2605.06797#bib.bib17)\)\. The de facto standard, the Fréchet Inception Distance \(FID\)\(Heuselet al\.,[2017](https://arxiv.org/html/2605.06797#bib.bib30)\)is based on a Gaussian approximation of pre\-trained network embeddings, such as the Inception\-v3 model\(Salimanset al\.,[2016](https://arxiv.org/html/2605.06797#bib.bib34)\)\.

This metric relies on estimating high\-dimensional mean and covariance matrices from inception embeddings\. However, this sample\-heavy approach typically requires5050k samples\(Chong and Forsyth,[2020](https://arxiv.org/html/2605.06797#bib.bib12)\), creating a significant development bottleneck\. Furthermore, because FID only considers the first two moments of the distributions, it is not a true distance metric and is vulnerable to adversarial "hacking" without corresponding visual improvements\(Sajjadiet al\.,[2018](https://arxiv.org/html/2605.06797#bib.bib36)\)\.

In this work, we introduce the Monge Inception Distance \(MIND\\operatorname\{\\texttt\{MIND\}\}\) addressing these limitations\. Based on optimal transport theory and named in honor of Gaspard Monge, who introduced the optimal transport problem\(Monge,[1781](https://arxiv.org/html/2605.06797#bib.bib38)\),MIND\\operatorname\{\\texttt\{MIND\}\}leverages the sliced Wasserstein distance\(Rabinet al\.,[2011](https://arxiv.org/html/2605.06797#bib.bib22)\)\. Instead of relying on Gaussian simplification as in FID,MIND\\operatorname\{\\texttt\{MIND\}\}reduces the complexity of high\-dimensional optimal transport comparison by averaging many one\-dimensional projections, where the transport problem is solved exactly via a simple, parallelizable sorting operation\. This captures finer distributional details without the statistical instability inherent in high\-dimensional matrix estimation \- see\(Villani,[2008](https://arxiv.org/html/2605.06797#bib.bib20); Peyré and Cuturi,[2019](https://arxiv.org/html/2605.06797#bib.bib21)\)for a modern perspective on optimal transport theory and applications\.

We highlight that this approach yields stable, high\-quality evaluation using only55k samples – an order of magnitude fewer than FID – while offering100×100\\timesfaster computation and increased robustness to adversarial moment\-matching\. We statistically validate the performance ofMIND\\operatorname\{\\texttt\{MIND\}\}across various hypothesis testing problems at different sample sizes\. Finally, while we present our results using Inception\-v3 embeddings to facilitate a direct comparison with the current FID benchmark, theMIND\\operatorname\{\\texttt\{MIND\}\}metric is fundamentally embedding\-agnostic\. It is modality\-independent and can be seamlessly applied to any representation space, including CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2605.06797#bib.bib2)\), DINO\(Oquabet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib13); Siméoniet al\.,[2025](https://arxiv.org/html/2605.06797#bib.bib3)\), or specialized embeddings for audio and video synthesis\.

Main contributions\.In this work, we introduceMIND\\operatorname\{\\texttt\{MIND\}\}, a metric for improved evaluation of generative models\. We demonstrate the following advantages of this metric:

- \-Sample Efficiency: We show thatMIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}provides a stable evaluation that correlates highly withFID50k\\text\{FID\}\_\{50k\}, enabling reliable benchmarking with10×10\\timesfewer samples\.
- \-Computational Speed and Memory Efficiency: Due to its reliance on 1D sorting rather than high\-dimensional matrix operations,MIND\\operatorname\{\\texttt\{MIND\}\}is over100×100\\timesfaster to compute, and requires10×10\\timesless memory, facilitating real\-time evaluation during training\.
- \-Metric Robustness: SinceMIND\\operatorname\{\\texttt\{MIND\}\}is derived from a proper distance, we show that it is significantly more resistant to "metric hacking" via moment\-matching attacks that can artificially lower FID\.
- \-Discriminative Power: Our experiments show thatMIND\\operatorname\{\\texttt\{MIND\}\}more reliably distinguishes between model checkpoints and identifies subtle image perturbations at low sample sizes\.

## 2Generative model evaluations

We consider the problem of evaluating a generative modelgθg\_\{\\theta\}, that generates outputs by mapping noiseZ∼𝒩\(0,I\)Z\\sim\\mathcal\{N\}\(0,I\)to data outputs \(e\.g\. images\)a=gθ\(Z\)a=g\_\{\\theta\}\(Z\)\. In standard practice, these outputs are passed through a pre\-trained feature extraction modelψw\\psi\_\{w\}to obtain embeddings\. For an Inception\-v3 model\(Szegedyet al\.,[2016](https://arxiv.org/html/2605.06797#bib.bib19)\), these embeddings typically reside in dimensiond=2,048d=2,048\.

The performance of the model is measured by the statistical distance between the distribution of generated embeddings,X=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}, and the distribution of real dataset embeddings,Y=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{data\}\. In practice, this distance is estimated using finite samples of sizenn, denoted as the empirical distributionsp^n,θ\\hat\{p\}\_\{n,\\theta\}andp^n,data\\hat\{p\}\_\{n,data\}\(see Appendix[A](https://arxiv.org/html/2605.06797#A1)\)\. Any measure of statistical distance between these distributions can be used, and we consider in this work several inception distances, defined as follows for any distance or divergenceΔ\\Deltabetween distributions\(see, e\.g\. Cover,[1999](https://arxiv.org/html/2605.06797#bib.bib15)\)\.

###### Definition 2\.1\(General \- Inception distance\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\},Y=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}\. For a distribution distance functionΔ\\Delta, the performance of the modelgθg\_\{\\theta\}is given byΔID\(pθ,pdata\)\\Delta\\text\{ID\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)\. With a sample of sizenn, its empirical estimate is the plug\-in valueΔID\(p^n,θ,p^n,data\)\\Delta\\text\{ID\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x2.png)Figure 2:General pipeline for evaluating generative model sampling distance to a dataset\.### 2\.1Existing method: Fréchet Inception Distance \- FID

FID is the most widely adopted instance of an inception distance\. It measures the distance between two distributions based on their first two moments: the mean \(μ\\mu\) and covariance \(Σ\\Sigma\), using the squared22\-Wasserstein distanceW22W\_\{2\}^\{2\}\(see Appendix[A\.2](https://arxiv.org/html/2605.06797#A1.SS2)\)

###### Definition 2\.2\(Fréchet Inception Distance \- FID\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}andμX,ΣX\\mu\_\{X\},\\Sigma\_\{X\}andμY,ΣY\\mu\_\{Y\},\\Sigma\_\{Y\}be the means and covariances ofpθp\_\{\\theta\}andpdatap\_\{data\}, respectively\. The FID is defined as the squared 2\-Wasserstein distance between two fitted Gaussians:

FID\(pθ,pdata\)=‖μX−μY‖2\+tr⁡\(ΣX\+ΣY−2\(ΣYΣX\)1/2\)\.\\text\{FID\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=\\\|\\mu\_\{X\}\-\\mu\_\{Y\}\\\|^\{2\}\+\\operatorname\{tr\}\(\\Sigma\_\{X\}\+\\Sigma\_\{Y\}\-2\(\\Sigma\_\{Y\}\\Sigma\_\{X\}\)^\{1/2\}\)\\,\.In practice, this is computed using empirical sample means and covariancesμ^n\\hat\{\\mu\}\_\{n\}andΣ^n\\hat\{\\Sigma\}\_\{n\}\.

This approach was originally motivated as a way to bypass the high computational and statistical complexity of a direct, sample\-based Wasserstein distance by using a Gaussian approximation\.

##### Drawbacks

Despite its widespread use and role as de facto standard metric, there are several drawbacks in using this distance, as noted in several works\(see, e\.g\. Karraset al\.,[2017](https://arxiv.org/html/2605.06797#bib.bib6); Steinet al\.,[2023](https://arxiv.org/html/2605.06797#bib.bib17); Jayasumanaet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib14); Bischoffet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib7); Yanget al\.,[2026](https://arxiv.org/html/2605.06797#bib.bib5)\)

- \-Computing this distance is based on the estimateΣ^n\\hat\{\\Sigma\}\_\{n\}of add\-by\-ddcovariance\. It is rank\-deficient forn≤dn\\leq d, creating numerical and statistical issues when estimating the second term, unless the sample size is at least of orderdd, which is20482048for inception networks\. For images, this means that the sample sizes usually used are1010k or5050k\(Bińkowskiet al\.,[2018](https://arxiv.org/html/2605.06797#bib.bib8); Chong and Forsyth,[2020](https://arxiv.org/html/2605.06797#bib.bib12)\), with a high impact on evaluation time and cost\.
- \-FID is also not a proper distance\(Jayasumanaet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib14)\): two distributions can have the same mean and covariance and be very different\(see, for example, Billingsley,[2017](https://arxiv.org/html/2605.06797#bib.bib11), Section 30\)\. We show that, as a consequence, it is not robust, and this fact can be leveraged to artificially reduce the FID without visually altering the images \(see Section[4\.5](https://arxiv.org/html/2605.06797#S4.SS5)\)\.

We also evaluate other inception\-based distances proposed as metrics, as means of comparison, such as theMaximum mean discrepancy\(MMD\), or Sinkhorn divergence\. We provide full definitions and a discussion in appendix \(Section[A\.3](https://arxiv.org/html/2605.06797#A1.SS3.SSS0.Px1)\), and include them in comparisons\.

## 3Proposed method: Monge Inception Distance \-MIND

We propose the following metric to overcome these challenges, based on the sliced Wasserstein distance which has several known advantages, both in terms of statistical and computational complexity\. It is an average of the Wasserstein distances of the distribution of projections, over all unit directions\(see, e\.g\. Rabinet al\.,[2011](https://arxiv.org/html/2605.06797#bib.bib22); Nadjahi,[2021](https://arxiv.org/html/2605.06797#bib.bib23), and Appendix[A\.2](https://arxiv.org/html/2605.06797#A1.SS2)in this work for details\)\.

The sliced Wasserstein distance is a proper distance \- it is equal to0if and only if both distributions are equal\(Bonnotte,[2013](https://arxiv.org/html/2605.06797#bib.bib4), Proposition 5\.1\.2\)\. We use this approach for ourMINDmetric, taking the average of Wasserstein distances projected along finitely many unit directions for an estimate\.

###### Definition 3\.1\(Monge Inception Distance\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}, and𝒰\(S\)\\mathcal\{U\}\(S\)be the uniform distribution on the unit sphere\.MIND\\operatorname\{\\texttt\{MIND\}\}is given by averagingW22W^\{2\}\_\{2\}distances for projections of the distributions along unit directions, with a multiplicative scalingα=3d\\alpha=3d

MIND⁡\(pθ,pdata\)=α𝔼u∼𝒰\(S\)\[W22\(u⊺pθ,u⊺pdata\)\],\\operatorname\{\\texttt\{MIND\}\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=\\alpha\\mathbb\{E\}\_\{u\\sim\\mathcal\{U\}\(S\)\}\[W^\{2\}\_\{2\}\(u^\{\\intercal\}p\_\{\\theta\},u^\{\\intercal\}p\_\{\\text\{data\}\}\)\]\\,,whereu⊺pθu^\{\\intercal\}p\_\{\\theta\}\(resp\.u⊺pdatau^\{\\intercal\}p\_\{\\text\{data\}\}\) is the distribution ofu⊺Xu^\{\\intercal\}XwhenX∼pθX\\sim p\_\{\\theta\}\(resp\. ofu⊺Yu^\{\\intercal\}YwhenY∼pdataY\\sim p\_\{\\text\{data\}\}\) andddis the data dimension\.

For finite samples\(Xj\)j∈\[n\]\(X\_\{j\}\)\_\{j\\in\[n\]\},\(Yj\)j∈\[n\]\(Y\_\{j\}\)\_\{j\\in\[n\]\}, random unit directions\(ui\)i∈\[M\]\(u\_\{i\}\)\_\{i\\in\[M\]\}andα=3d\\alpha=3d, it is given by

MIND⁡\(p^n,θ,p^n,data\)=αM∑i=1MW22\(ui⊺p^n,θ,ui⊺p^n,data\)=αnM∑i=1M∑j=1n\|sort\(ui⊺X\)j−sort\(ui⊺Y\)j\|2\.\\operatorname\{\\texttt\{MIND\}\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)=\\frac\{\\alpha\}\{M\}\\sum\_\{i=1\}^\{M\}W^\{2\}\_\{2\}\(u\_\{i\}^\{\\intercal\}\\hat\{p\}\_\{n,\\theta\},u\_\{i\}^\{\\intercal\}\\hat\{p\}\_\{n,\\text\{data\}\}\)=\\frac\{\\alpha\}\{nM\}\\sum\_\{i=1\}^\{M\}\\sum\_\{j=1\}^\{n\}\|\\text\{sort\}\(u\_\{i\}^\{\\intercal\}X\)\_\{j\}\-\\text\{sort\}\(u\_\{i\}^\{\\intercal\}Y\)\_\{j\}\|^\{2\}\\,\.

##### Remarks

![Refer to caption](https://arxiv.org/html/2605.06797v1/x3.png)

\\cprotect

Figure 3:Computation ofMINDbased on the idea of Sliced Wasserstein, illustrated in 2D with a single projection\. \(Left\) Two samples of synthetic embeddings \(orange and blue\), along with the unit sphere and a random unit directionuu\. \(Bottom Right\) The two histograms of distributions of the projections alonguu\. \(Top Right\) The associated cumulative distribution functions \(cdf\), the hatched area is related to 1D Wasserstein distances alonguu: it is theW1W\_\{1\}distance, and used for all convex costs with pairwise sorted distances\.- \-MINDrelies on two finite sample estimates: the11D Wasserstein distance over samplesXjX\_\{j\}andYjY\_\{j\}of sizenn, and the expectation𝔼u∼𝒰\(S\)\\mathbb\{E\}\_\{u\\sim\\mathcal\{U\}\(S\)\}overMMrandom unit directionsuiu\_\{i\}\.
- \-Although we adopt the name of Inception Distance for consistency with established literature, the formulation ofMINDdoes not depend on the Inception architecture\. The mapping functionψw\\psi\_\{w\}can represent any feature extractor; consequently,MINDserves as a general\-purpose tool for evaluating distributional similarity across diverse data modalities and embedding models\.
- \-This11D formulation allows a more stable evaluation using an order of magnitude fewer samples – withnnof order55k rather than5050k \(see Sections[4\.3](https://arxiv.org/html/2605.06797#S4.SS3)and[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)\)\. Furthermore, because the sliced Wasserstein distance is a proper distance, having matching means and covariance matrices is not sufficient forMIND\\operatorname\{\\texttt\{MIND\}\}to be zero, making it inherently robust to moment\-matching hacking \(see Section[4\.5](https://arxiv.org/html/2605.06797#S4.SS5)\)\.

- \-Leveraging the exact solution of11D transport problem\(see, e\.g\. Peyré and Cuturi,[2019](https://arxiv.org/html/2605.06797#bib.bib21), and Appendix[A\.2](https://arxiv.org/html/2605.06797#A1.SS2)in this work\), the distance simplifies to pair\-wise difference between sorted elements: W22\(p^n,q^n\)=1n∑j=1n\|sort\(x\)j−sort\(y\)j\|2,W\_\{2\}^\{2\}\(\\hat\{p\}\_\{n\},\\hat\{q\}\_\{n\}\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\|\\text\{sort\}\(x\)\_\{j\}\-\\text\{sort\}\(y\)\_\{j\}\|^\{2\}\\,,wheresort:ℝn→ℝn\\text\{sort\}:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}is the function that maps a vectorx∈ℝnx\\in\\mathbb\{R\}^\{n\}to its copy sorted in nondecreasing order, i\.e\.sort\(x\)=\(xσ\(1\),…,xσ\(n\)\)⊺\\text\{sort\}\(x\)=\(x\_\{\\sigma\(1\)\},\\ldots,x\_\{\\sigma\(n\)\}\)^\{\\intercal\}such thatxσ\(1\)≤…≤xσ\(n\)x\_\{\\sigma\(1\)\}\\leq\\ldots\\leq x\_\{\\sigma\(n\)\}\(ties do not make this function ambiguous\)\. Since the sorting operation runs inO\(nlog⁡n\)O\(n\\log n\)time,MIND\\operatorname\{\\texttt\{MIND\}\}avoids the need to estimate or store high\-dimensional objects \(e\.g\.d×dd\\times dmatrices for FID,n×nn\\times nmatrices for other distances\)\. Similarly to FID, we use the square of the distance rather than taking a square root of the average\. We also use a multiplicative scaling factorα\\alpha—see discussion in Section[4\.2](https://arxiv.org/html/2605.06797#S4.SS2)\.

\(a\)JAX![Refer to caption](https://arxiv.org/html/2605.06797v1/x4.png)

\(b\)PyTorch![Refer to caption](https://arxiv.org/html/2605.06797v1/x5.png)

\\cprotect

Figure 4:JAX \(a\) and PyTorch \(b\) implementation ofMIND\.

## 4Experiments

### 4\.1Implementation

As noted above, estimatingMIND\\operatorname\{\\texttt\{MIND\}\}on two samples of sizennis both computationally and conceptually easy\. It requires only trivially parallelizable projections and sorting operations\. As such, it is particularly adapted to modern accelerated\-oriented hardware and software\. We provide in Figure[4](https://arxiv.org/html/2605.06797#S3.F4)both a JAX\(Bradburyet al\.,[2018](https://arxiv.org/html/2605.06797#bib.bib16)\)and PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2605.06797#bib.bib27)\)implementation, in the form of a short code snippet that can be directly used in an evaluation pipeline\. We also provide in Section[4\.7](https://arxiv.org/html/2605.06797#S4.SS7)and[4\.8](https://arxiv.org/html/2605.06797#S4.SS8)experimental results showcasing the computation time and memory advantages of this algorithm compared to other methods\.

### 4\.2Hyperparameter choices

As noted in Definition[3\.1](https://arxiv.org/html/2605.06797#S3.Thmtheorem1), we scale theMIND\\operatorname\{\\texttt\{MIND\}\}metric by a multiplicative factorα\>0\\alpha\>0\. This is done so that the order of magnitude of this metric matches those of FID\. This proximity helps to compare values ofMIND\\operatorname\{\\texttt\{MIND\}\}to those of FID, and is chosen to favor adoption\. Based on an analysis on ImageNet\-64, we have found that takingα=3×d≈6,000\\alpha=3\\times d\\approx 6,000is a good fit, especially later in a training run \(whered=2,048d=2,048is the dimension of the embedding space\) \- see Figure[5](https://arxiv.org/html/2605.06797#S4.F5)\. We also observe thatMIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}has more range than FID for any sample size, with higher values early in the run \(above even those ofFID5k\\text\{FID\}\_\{5k\}\), and aligned withFID50k\\text\{FID\}\_\{50k\}later in the run, and better aligned with the number of steps in a training run \(see Figure[1](https://arxiv.org/html/2605.06797#S0.F1), right\)\. We also computeMIND\\operatorname\{\\texttt\{MIND\}\}and FID score with features of various dimension\. The features are obtained by truncating the Inception\-v3 features to the target dimension\. Figure[5](https://arxiv.org/html/2605.06797#S4.F5)\(Right\) shows that theMIND\\operatorname\{\\texttt\{MIND\}\}remains an affine relation with respect to FID while varying the dimension of the feature space in thelog\\log\-log\\logplot—justifying the choice of the scaling factorα∝d\\alpha\\propto d\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x6.png)
![Refer to caption](https://arxiv.org/html/2605.06797v1/x7.png)

Figure 5:MIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}andFID50k\\text\{FID\}\_\{50k\}\(and other sample sizes\) for a model at different steps of training on ImageNet\-64\.Left: AllMIND\\operatorname\{\\texttt\{MIND\}\}metrics are rescaled by a factorα≈6,000\\alpha\\approx 6,000chosen to optimize proximity with FID, colors indicating the step at which the metric is evaluated\.Right:MIND\\operatorname\{\\texttt\{MIND\}\}without scaling factor for various embedding dimension, colors indicating the dimension in which the metric is evaluated\.
### 4\.3MINDanalysis

We illustrate the behavior of theMINDmetric by analyzing its dependency onnn\(the number of samples from the distributions\) andMM\(the number of random projections\)\. Fixing a numberk\>0k\>0\(varies in different settings\), we evaluate its ability to correctly orderkkdifferent distributionsp1,…pkp^\{1\},\\ldots p^\{k\}relative to the true data distribution\. Specifically, we calculate the probability of error in correctly ranking the sequence of distancesMIND⁡\(p^n1,p^n,data\),…,MIND⁡\(p^nk,p^n,data\)\\operatorname\{\\texttt\{MIND\}\}\(\\hat\{p\}\_\{n\}^\{1\},\\hat\{p\}\_\{n,\\text\{data\}\}\),\\ldots,\\operatorname\{\\texttt\{MIND\}\}\(\\hat\{p\}\_\{n\}^\{k\},\\hat\{p\}\_\{n,\\text\{data\}\}\)\. This measures the ability of the metric to distinguish different images from the elements of the dataset\. Our observations indicate thatn=5,000n=5,000samples is sufficient to reliably distinguish these distributions \(we benchmark this against other metrics quantitatively in Section[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x8.png)Figure 6:\(Top Left and Middle\) Behavior of theMINDand FID metric innn, to distinguish true images from the dataset \(base \- in blue\) from generated images \(model \- in orange\)\. \(Bottom Left and Middle\) Histogram of the trials forn=5,000n=5,000\- A bigger gap is better\. \(Right\) Probability of error defined in Section[4\.4\.1](https://arxiv.org/html/2605.06797#S4.SS4.SSS1)for three values ofM∈\{10,100,1000\}M\\in\\\{10,100,1000\\\}\.We also plot the dependency of the estimated metric onMM, the number of uniformly chosen random projections\. This dependency is easier to analyze, since the Monte\-Carlo estimate is obtained by averaging unbiased terms to compute the expected Wasserstein distance\. We observe that choosingMMin the range\[100,1000\]\[100,\\,1000\]is sufficient \(see Appendix[C](https://arxiv.org/html/2605.06797#A3)\)\.

### 4\.4Metric comparison

![Refer to caption](https://arxiv.org/html/2605.06797v1/x9.png)Figure 7:Sample complexity measured by the probability of error for the correct order at five different steps of training\.Running evaluations during training of a diffusion model, we observe that instead of usingFID50k\\text\{FID\}\_\{50k\}\(commonly used post\-training because of the cost and time associated with the high sample size\), we can useMIND5k\\operatorname\{\\texttt\{MIND\}\}\_\{5k\}\(we evaluate their precisions more quantitatively in the rest of this section\)\. As visible in Figure[5](https://arxiv.org/html/2605.06797#S4.F5), these two metrics are highly correlated, especially later during training\.

In order to compare different metrics in a principled fashion, we evaluate how useful they are to distinguish distributions\. This provides naturaltasksfor which these metrics can be evaluated asmethods, through the lens of statistical hypothesis testing\. This can also be understood as a comparison in distribution of the metrics \(for random samples\), rather than a single value\. Given sample sizennand metricΔ\\Delta, we state our three statistical hypothesis testings in the following\.

#### 4\.4\.1Generated vs\. true data

This is done by comparing two distributionspθp\_\{\\theta\}\(for some pre\-trained modelgθg\_\{\\theta\}, with parametersθ\\theta\) withpdatap\_\{\\text\{data\}\}, and estimatingΔ\(p^n,θ,p^n,data\)\\Delta\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)\. Our diffusion model utilizes a U\-Net backbone\(Ronnebergeret al\.,[2015](https://arxiv.org/html/2605.06797#bib.bib10); Nichol and Dhariwal,[2021](https://arxiv.org/html/2605.06797#bib.bib9)\)trained on Imagenet\-64\. Experiments relying on real data use a fixed set of 100,000 original Imagenet\-64 images\. Conversely, for evaluations involving data generated from models, we generate a dedicated fixed set of 50,000 samples from each model under evaluation\.

We are comparing the values of the metric under two settings, wherep^n,θ\\hat\{p\}\_\{n,\\theta\}is a distribution ofnnsamples generated from a trained model, andp^n,data\\hat\{p\}\_\{n,\\text\{data\}\}andp^n,data′\\hat\{p\}^\{\\prime\}\_\{n,\\text\{data\}\}are two independent samples of sizennfrom the data\. The probability of error is defined as:

𝐏\(Δ\(p^n,data,p^n,data′\)≥Δ\(p^n,data,p^n,θ\)\)\.\\mathbf\{P\}\\big\(\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}^\{\\prime\}\_\{n,\\text\{data\}\}\)\\geq\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}\_\{n,\\theta\}\)\\big\)\\,\.This measures the ability of the metric to distinguish generated images from elements of the dataset\. The results are given in Figure[6](https://arxiv.org/html/2605.06797#S4.F6)\. We remark that,MIND\\operatorname\{\\texttt\{MIND\}\}is able to separate the two distributions as soon asn≥5,000n\\geq 5,000\(Figure[6](https://arxiv.org/html/2605.06797#S4.F6)Left column\), while FID requires more than1010k samples to do so \(Figure[6](https://arxiv.org/html/2605.06797#S4.F6)middle column\)\.

#### 4\.4\.2Monotonicity

We compareMIND\\operatorname\{\\texttt\{MIND\}\}with other metrics using a diffusion modelgθg\_\{\\theta\}trained on the ImageNet\-64 dataset\(Denget al\.,[2009](https://arxiv.org/html/2605.06797#bib.bib25)\), from which we selected five models,gθ1,…,gθ5g\_\{\\theta\_\{1\}\},\\dots,g\_\{\\theta\_\{5\}\}, corresponding to five distinct training checkpoints and generate5050k images with each of them\. The probability of error in ranking these checkpoints correctly is:

1−𝐏\(Δ\(p^n,data,p^n,θ1\)≥…≥Δ\(p^n,data,p^n,θk\)\)\.1\-\\mathbf\{P\}\\big\(\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}\_\{n,\\theta\_\{1\}\}\)\\geq\\ldots\\geq\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}\_\{n,\\theta\_\{k\}\}\)\\big\)\\,\.For several sample sizesnnranging from1010to1010k, we perform512512independent trials for each metric\. We observe thatMIND, MMD and sliced FID achieves similar performance in this test \(Figure[7](https://arxiv.org/html/2605.06797#S4.F7)\)\.

#### 4\.4\.3Perturbations

We consider three different types of image perturbations, the severity of perturbation is given by a parameterε\\varepsilon\. Each experiment is performed with512512independent trials\. We measure the performance of each metric using

1−𝐏\(Δ\(p^n,data,p^n,data,ε1\)\\displaystyle 1\-\\mathbf\{P\}\\big\(\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}\_\{n,\\text\{data\},\\varepsilon\_\{1\}\}\)≤…≤Δ\(p^n,data,p^n,data,εk\)\)\.\\displaystyle\\leq\\ldots\\leq\\Delta\(\\hat\{p\}\_\{n,\\text\{data\}\},\\hat\{p\}\_\{n,\\text\{data\},\\varepsilon\_\{k\}\}\)\\big\)\\,\.which is the probability of failing to order all perturbation levels\. The results are summarized in Figure[8](https://arxiv.org/html/2605.06797#S4.F8)\. We highlight that, forn≥5,000n\\geq 5,000,MIND\\operatorname\{\\texttt\{MIND\}\}achieves the same level of performance as MMD while FID is worse on all tasks\.

##### Gaussian blur\.

We select a perturbation levelε∈\{0\.2,0\.4,0\.6,0\.8,1\.0\}\\varepsilon\\in\\\{0\.2,0\.4,0\.6,0\.8,1\.0\\\}\. We apply a Gaussian filter with standard deviationε\\varepsilonto each image in ImageNet\-64\.

##### Rectangle\.

We select a perturbation levelε∈\{0\.05,0\.1,0\.15,0\.2\}\\varepsilon\\in\\\{0\.05,0\.1,0\.15,0\.2\\\}\. We randomly place55squares of size10×1010\\times 10withε\\varepsilonopacity in each image of ImageNet\-64\.

##### Mixture of datasets\.

We select a perturbation levelε∈\{1%,3%,5%,7%,10%\}\\varepsilon\\in\\\{1\\%,3\\%,5\\%,7\\%,10\\%\\\}\. We draw samples with proportion of\(1−ε\)\(1\-\\varepsilon\)from ImageNet\-64 and ofε\\varepsilonfrom CelebA\(Liuet al\.,[2015](https://arxiv.org/html/2605.06797#bib.bib26)\)\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x10.png)Figure 8:Sample complexity measured by the probability of detecting a small perturbation\.

### 4\.5Robustness to metric hacking with moment matching

As mentioned above, one of the weaknesses of FID is that it is not a proper distance\. Indeed, since this metric is only a function of the means and covariances of the considered distributions, ifppandqqshare the same first and second moments, thenFID\(p,q\)=0\\text\{FID\}\(p,q\)=0\. This is not the case for proper metrics derving from distances such asMIND\. We leverage this fact to create an artificial distribution of samples that have a desired mean and covariance\. This construction is based on the following property whose proof is in Appendix[B\.1](https://arxiv.org/html/2605.06797#A2.SS1)\.

###### Proposition 4\.1\.

Letppbe a target distribution overℝd\\mathbb\{R\}^\{d\}with meanμ\\muand covarianceΣ\\Sigma, whose eigendecomposition is

Σ=USU⊺=∑i=1rλiuiui⊺\.\\Sigma=USU^\{\\intercal\}=\\sum\_\{i=1\}^\{r\}\\lambda\_\{i\}u\_\{i\}u\_\{i\}^\{\\intercal\}\\,\.Define the2r2rvectorsvi\(\+\)v^\{\(\+\)\}\_\{i\}andvi\(−\)v\_\{i\}^\{\(\-\)\}indexed byi∈\[r\]i\\in\[r\], by

vi\(\+\)=μ\+αui,vi\(−\)=μ−αui,v^\{\(\+\)\}\_\{i\}=\\mu\+\\alpha u\_\{i\}\\,,\\quad v^\{\(\-\)\}\_\{i\}=\\mu\-\\alpha u\_\{i\}\\,,withα=Tr\(Σ\)\\alpha=\\sqrt\{\\textbf\{Tr\}\(\\Sigma\)\}\. Defineπi\(\+\)=πi\(−\)=λi/\(2Tr\(Σ\)\)\\pi^\{\(\+\)\}\_\{i\}=\\pi\_\{i\}^\{\(\-\)\}=\\lambda\_\{i\}/\(2\\textbf\{Tr\}\(\\Sigma\)\)and note that theπi\\pi\_\{i\}are nonnegative and sum to 1\. Letq^\\hat\{q\}be the distribution of theviv\_\{i\}’s, each with probabilityπi\\pi\_\{i\}, given by

q^=∑i=1rπi\(\+\)δvi\(\+\)\+∑i=1rπi\(−\)δvi\(−\)\.\\hat\{q\}=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}^\{\(\+\)\}\\delta\_\{v\_\{i\}^\{\(\+\)\}\}\+\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}^\{\(\-\)\}\\delta\_\{v\_\{i\}^\{\(\-\)\}\}\\,\.
It holds that

𝔼q^\[v\]=μ,𝔼q^\[\(v−μ\)\(v−μ\)⊺\]=Σ,FID\(q^,p\)=0\.\\mathbb\{E\}\_\{\\hat\{q\}\}\[v\]=\\mu\\,,\\quad\\mathbb\{E\}\_\{\\hat\{q\}\}\[\(v\-\\mu\)\(v\-\\mu\)^\{\\intercal\}\]=\\Sigma\\,,\\quad\\text\{FID\}\(\\hat\{q\},p\)=0\\,\.

### 4\.6Moment matching procedure

This proposition can be leveraged to performmetric hackingwith moment matching: For a batch ofnnimagesa0a^\{0\}with embedding distributionq^0\\hat\{q\}^\{0\}we constructa=a0\+εa=a^\{0\}\+\\varepsilonwhose embeddings have distributionq^\\hat\{q\}such that the metricΔ\(q^,p^n,data\)\\Delta\(\\hat\{q\},\\hat\{p\}\_\{n,\\text\{data\}\}\)is much smaller thanΔ\(q^0,p^n,data\)\\Delta\(\\hat\{q\}^\{0\},\\hat\{p\}\_\{n,\\text\{data\}\}\)\(for some data distributionpdatap\_\{\\text\{data\}\}\) by optimizing the moments of these embeddings, using the target vectors given by Proposition[4\.1](https://arxiv.org/html/2605.06797#S4.Thmtheorem1)with no visually discernible alteration \(see Figure[11](https://arxiv.org/html/2605.06797#A3.F11)in Appendix[C\.4](https://arxiv.org/html/2605.06797#A3.SS4)\)\.

We do so in the following manner: forn=2rn=2rand a batcha0∈ℝ2r×\[dims\]a^\{0\}\\in\\mathbb\{R\}^\{2r\\times\[\\text\{dims\}\]\}of2r2rimages, each of shape\[dims\]\[\\text\{dims\}\]\(e\.g\.\[512,512,3\]\[512,512,3\]\), and a target distributionpdatap\_\{\\text\{data\}\}overℝd\\mathbb\{R\}^\{d\}, we consider the following objective, aiming to give eachaia\_\{i\}an embedding close to the targetviv\_\{i\}

mina∈ℝ2r×\[dims\]⁡ℓ\(a\)=mina∈ℝ2r×\[dims\]∑i=12r‖ψw\(ai\)−vi‖2\.\\min\_\{a\\in\\mathbb\{R\}^\{2r\\times\[\\text\{dims\}\]\}\}\\ell\(a\)=\\min\_\{a\\in\\mathbb\{R\}^\{2r\\times\[\\text\{dims\}\]\}\}\\sum\_\{i=1\}^\{2r\}\\\|\\psi\_\{w\}\(a\_\{i\}\)\-v\_\{i\}\\\|^\{2\}\\,\.
Table 1:Robustness of several metrics under moment matchingWe initializeaaata0a^\{0\}that is2r2rcopies of the same image\. This optimization problem is highly parallelizable since the loss is fully separable over each of theaia\_\{i\}, and we can use stochastic based optimization methods to solve it\. If the batchaasatisfyℓ\(a\)=0\\ell\(a\)=0, then the FID of the distribution of theaia\_\{i\}with probabilitiesπi\\pi\_\{i\}is also0, and we show that optimizing this loss reduces the FID significantly\.

In this experiment, we use a full\-rank batch,r=2048r=2048, the dimension of the latent space\. Therefore, the total batch size isn=4096n=4096and we separate the optimization problem\. We use in our evaluationM=1000M=1000forMINDand5050k to compute the reference mean and covariance for the FID\. The results summarized in Table 1 show that several of these metrics are highly sensitive to moment matching hacking, with only10%~10\\%or less of the metric remaining for the baseline metrics \(and much less for sample\-efficient metric versions of the FID\), and that while affected, theMINDmetric is much more robust\.

### 4\.7Computation time comparison

We compare the running time ofcomputingthe different metrics on TPUv4, given two sets of embeddings with sample sizenn\. We emphasize that this is only the time to compute the metric, not to generate the samples, which is roughly linear innn\. In Figure[9\(a\)](https://arxiv.org/html/2605.06797#S4.F9.sf1), we observe that computingMINDat its recommended sample size of55k is more than22orders of magnitude faster than computing FID\. This is an additional difference, on top of the time necessary to sample a much larger sample when using FID\.

### 4\.8Peak memory comparison

We compare the peak memory required to compute the different metrics on a TPUv4, using two sets of embeddings with sample sizenn\. Note that these measurements reflect only the additional memory consumed during metric computation which do not include the memory occupied by the input data itself, the latter results are provided in Appendix[C\.2](https://arxiv.org/html/2605.06797#A3.SS2)\. In Figure[9\(b\)](https://arxiv.org/html/2605.06797#S4.F9.sf2), we highlight that computingMINDat its recommended sample size \(n=5n=5k\) requires over an order of magnitude less memory than computing either MMD or FID\. \(Note that the curves for MMD andMIND\\operatorname\{\\texttt\{MIND\}\}collapse across different input dimensions, resulting in overlapping lines\)\.

![Refer to caption](https://arxiv.org/html/2605.06797v1/x11.png)\(a\)Execution time
![Refer to caption](https://arxiv.org/html/2605.06797v1/x12.png)\(b\)Peak Memory\\cprotect

Figure 9:Walltime and peak memory comparison forMIND\\operatorname\{\\texttt\{MIND\}\}, MMD, and FID

## Conclusion

In this work, we introducedMIND, a metric for evaluating generative models that addresses statistical and computational limitations of FID\. Our empirical results demonstrate thatMINDis faster to compute and achieves stable evaluations with sample sizes as low as22k, compared to the5050k typically required for FID\. Furthermore, as a proper distance metric,MINDexhibits better robustness to moment\-matching adversarial attacks than other metrics, while being affected by it\. As a purely statistical metric,MINDmeasures the distributional distance to a reference dataset \(such as the training data\)\. It is not designed to evaluate other qualitative aspects of the generated images, such as visual aesthetics or text legibility\. We believe this metric provides a rigorous, efficient, and reliable standard for assessing the quality of modern generative models\.

## References

- Probability and measure\.John Wiley & Sons\.Cited by:[item \-](https://arxiv.org/html/2605.06797#S2.I1.ix2.p1.1)\.
- M\. Bińkowski, D\. J\. Sutherland, M\. Arbel, and A\. Gretton \(2018\)Demystifying MMD GANs\.InThe Sixth International Conference on Learning Representations,Cited by:[item \-](https://arxiv.org/html/2605.06797#S2.I1.ix1.p1.8)\.
- S\. Bischoff, A\. Darcher, M\. Deistler, R\. Gao, F\. Gerken, M\. Gloeckler, L\. Haxel, J\. Kapoor, J\. K\. Lappalainen, J\. H\. Macke,et al\.\(2024\)A practical guide to sample\-based statistical distances for evaluating generative models in science\.arXiv:2403\.12636\.Cited by:[§2\.1](https://arxiv.org/html/2605.06797#S2.SS1.SSS0.Px1.p1.1)\.
- N\. Bonnotte \(2013\)Unidimensional and evolution methods for optimal transportation\.Ph\.D\. Thesis,Université Paris Sud\-Paris XI; Scuola normale superiore \(Pise, Italie\)\.Cited by:[§3](https://arxiv.org/html/2605.06797#S3.p2.1)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne,et al\.\(2018\)JAX: composable transformations of Python\+ NumPy programs\.Cited by:[§4\.1](https://arxiv.org/html/2605.06797#S4.SS1.p1.2)\.
- M\. J\. Chong and D\. Forsyth \(2020\)Effectively unbiased FID and inception score and where to find them\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6070–6079\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p2.1),[item \-](https://arxiv.org/html/2605.06797#S2.I1.ix1.p1.8)\.
- T\. M\. Cover \(1999\)Elements of information theory\.John Wiley & Sons\.Cited by:[§2](https://arxiv.org/html/2605.06797#S2.p2.6)\.
- M\. Cuturi \(2013\)Sinkhorn distances: lightspeed computation of optimal transport\.InAdvances in Neural Information Processing Systems,C\.J\. Burges, L\. Bottou, M\. Welling, Z\. Ghahramani, and K\. Weinberger \(Eds\.\),Vol\.26,pp\.\.Cited by:[item \-](https://arxiv.org/html/2605.06797#A1.I4.ix1.p1.3),[§A\.2](https://arxiv.org/html/2605.06797#A1.SS2.p1.11)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)ImageNet: A large\-scale hierarchical image database\.In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops \(CVPR Workshops\),Vol\.,Los Alamitos, CA, USA,pp\. 248–255\.External Links:ISSN 1063\-6919Cited by:[§4\.4\.2](https://arxiv.org/html/2605.06797#S4.SS4.SSS2.p1.4)\.
- R\. M\. Dudley \(1969\)The speed of mean Glivenko\-Cantelli convergence\.Annals of Mathematical Statistics40,pp\. 40–50\.External Links:[Link](https://api.semanticscholar.org/CorpusID:121048606)Cited by:[§A\.2](https://arxiv.org/html/2605.06797#A1.SS2.p1.11)\.
- A\. Genevay, G\. Peyre, and M\. Cuturi \(2018\)Learning generative models with Sinkhorn divergences\.InProceedings of the Twenty\-First International Conference on Artificial Intelligence and Statistics,A\. Storkey and F\. Perez\-Cruz \(Eds\.\),Proceedings of Machine Learning Research, Vol\.84,pp\. 1608–1617\.Cited by:[Definition A\.6](https://arxiv.org/html/2605.06797#A1.Thmtheorem6)\.
- A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola \(2012\)A kernel two\-sample test\.Journal of Machine Learning Research13\(25\),pp\. 723–773\.Cited by:[Definition A\.7](https://arxiv.org/html/2605.06797#A1.Thmtheorem7)\.
- M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter \(2017\)GANs trained by a two time\-scale update rule converge to a local nash equilibrium\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.Cited by:[§B\.1](https://arxiv.org/html/2605.06797#A2.SS1.SSS0.Px3.p1.4),[§1](https://arxiv.org/html/2605.06797#S1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p1.1)\.
- S\. Jayasumana, S\. Ramalingam, A\. Veit, D\. Glasner, A\. Chakrabarti, and S\. Kumar \(2024\)Rethinking FID: towards a better evaluation metric for image generation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9307–9315\.Cited by:[item \-](https://arxiv.org/html/2605.06797#A1.I5.ix5.p1.1),[Definition A\.7](https://arxiv.org/html/2605.06797#A1.Thmtheorem7),[item \-](https://arxiv.org/html/2605.06797#S2.I1.ix2.p1.1),[§2\.1](https://arxiv.org/html/2605.06797#S2.SS1.SSS0.Px1.p1.1)\.
- L\. V\. Kantorovich \(1942\)On the translocation of masses\.Doklady Akademii Nauk SSSR37\(7\-8\),pp\. 227–229\.Cited by:[Definition A\.2](https://arxiv.org/html/2605.06797#A1.Thmtheorem2.p1.2.1)\.
- T\. Karras, T\. Aila, S\. Laine, and J\. Lehtinen \(2017\)Progressive growing of GANs for improved quality, stability, and variation\.arXiv:1710\.10196\.Cited by:[§2\.1](https://arxiv.org/html/2605.06797#S2.SS1.SSS0.Px1.p1.1)\.
- Z\. Liu, P\. Luo, X\. Wang, and X\. Tang \(2015\)Deep learning face attributes in the wild\.In2015 IEEE International Conference on Computer Vision \(ICCV\),Vol\.,pp\. 3730–3738\.Cited by:[§4\.4\.3](https://arxiv.org/html/2605.06797#S4.SS4.SSS3.Px3.p1.3)\.
- G\. Monge \(1781\)Mémoire sur la théorie des déblais et des remblais\.Histoire de l’Académie Royale des Sciences,pp\. 666–704\.Cited by:[Definition A\.2](https://arxiv.org/html/2605.06797#A1.Thmtheorem2.p1.2.1),[§1](https://arxiv.org/html/2605.06797#S1.p3.3)\.
- K\. Nadjahi, A\. Durmus, L\. Chizat, S\. Kolouri, S\. Shahrampour, and U\. Simsekli \(2020\)Statistical and topological properties of sliced probability divergences\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 20802–20812\.Cited by:[§A\.2](https://arxiv.org/html/2605.06797#A1.SS2.p4.1)\.
- K\. Nadjahi \(2021\)Sliced\-Wasserstein distance for large\-scale machine learning : theory, methodology and extensions\.Theses,Institut Polytechnique de Paris\.External Links:[Link](https://theses.hal.science/tel-03533097)Cited by:[Definition A\.3](https://arxiv.org/html/2605.06797#A1.Thmtheorem3.p1.1.1),[§3](https://arxiv.org/html/2605.06797#S3.p1.1)\.
- A\. Q\. Nichol and P\. Dhariwal \(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[§4\.4\.1](https://arxiv.org/html/2605.06797#S4.SS4.SSS1.p1.5)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. V\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski \(2024\)DINOv2: learning robust visual features without supervision\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p4.4)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala \(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.Cited by:[§4\.1](https://arxiv.org/html/2605.06797#S4.SS1.p1.2)\.
- G\. Peyré and M\. Cuturi \(2019\)Computational optimal transport: with applications to data science\.Foundations and Trends® in Machine Learning11\(5\-6\),pp\. 355–607\.Cited by:[Definition A\.6](https://arxiv.org/html/2605.06797#A1.Thmtheorem6.p1.5.5),[§1](https://arxiv.org/html/2605.06797#S1.p3.3),[item \-](https://arxiv.org/html/2605.06797#S3.I2.ix1.p1.1)\.
- J\. Rabin, G\. Peyré, J\. Delon, and M\. Bernot \(2011\)Wasserstein barycenter and its application to texture mixing\.InInternational conference on scale space and variational methods in computer vision,pp\. 435–446\.Cited by:[§A\.2](https://arxiv.org/html/2605.06797#A1.SS2.p5.2),[Definition A\.3](https://arxiv.org/html/2605.06797#A1.Thmtheorem3.p1.1.1),[§1](https://arxiv.org/html/2605.06797#S1.p3.3),[§3](https://arxiv.org/html/2605.06797#S3.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p4.4)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-Net: convolutional networks for biomedical image segmentation\.InInternational Conference on Medical image computing and computer\-assisted intervention,pp\. 234–241\.Cited by:[§4\.4\.1](https://arxiv.org/html/2605.06797#S4.SS4.SSS1.p1.5)\.
- M\. S\. M\. Sajjadi, O\. Bachem, M\. Lucic, O\. Bousquet, and S\. Gelly \(2018\)Assessing generative models via precision and recall\.InAdvances in Neural Information Processing Systems,S\. Bengio, H\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(Eds\.\),Vol\.31,pp\.\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p2.1)\.
- T\. Salimans, I\. Goodfellow, W\. Zaremba, V\. Cheung, A\. Radford, and X\. Chen \(2016\)Improved techniques for training GANs\.InAdvances in Neural Information Processing Systems,D\. Lee, M\. Sugiyama, U\. Luxburg, I\. Guyon, and R\. Garnett \(Eds\.\),Vol\.29,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p1.1)\.
- O\. Siméoni, H\. V\. Vo, M\. Seitzer, F\. Baldassarre, M\. Oquab, C\. Jose, V\. Khalidov, M\. Szafraniec, S\. Yi, M\. Ramamonjisoa,et al\.\(2025\)DINOv3\.arXiv:2508\.10104\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p4.4)\.
- G\. Stein, J\. Cresswell, R\. Hosseinzadeh, Y\. Sui, B\. Ross, V\. Villecroze, Z\. Liu, A\. L\. Caterini, E\. Taylor, and G\. Loaiza\-Ganem \(2023\)Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 3732–3784\.Cited by:[§1](https://arxiv.org/html/2605.06797#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.06797#S2.SS1.SSS0.Px1.p1.1)\.
- C\. Szegedy, V\. Vanhoucke, S\. Ioffe, J\. Shlens, and Z\. Wojna \(2016\)Rethinking the inception architecture for computer vision\.In2016 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 2818–2826\.Cited by:[§2](https://arxiv.org/html/2605.06797#S2.p1.5)\.
- C\. Villani \(2008\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[item \-](https://arxiv.org/html/2605.06797#A1.I1.ix3.p1.3),[§1](https://arxiv.org/html/2605.06797#S1.p3.3)\.
- J\. Yang, Z\. Geng, X\. Ju, Y\. Tian, and Y\. Wang \(2026\)Representation Fréchet loss for visual generation\.arXiv:2604\.28190\.Cited by:[§2\.1](https://arxiv.org/html/2605.06797#S2.SS1.SSS0.Px1.p1.1)\.

## Appendix ADefinitions

### A\.1Empirical measures

###### Definition A\.1\.

For a sampleY1,…,YnY\_\{1\},\\ldots,Y\_\{n\}of sizennfrom some data distributionpdatap\_\{\\text\{data\}\}, we denote byp^n,data\\hat\{p\}\_\{n,\\text\{data\}\}the empirical distribution of theYiY\_\{i\}s, defined by

p^n,data=1n∑jδYj\.\\hat\{p\}\_\{n,\\text\{data\}\}=\\frac\{1\}\{n\}\\sum\_\{j\}\\delta\_\{Y\_\{j\}\}\\,\.Similarly, forX1,…,XnX\_\{1\},\\ldots,X\_\{n\}frompθp\_\{\\theta\}we denote byp^n,θ\\hat\{p\}\_\{n,\\theta\}the empirical distribution of theXiX\_\{i\}s

p^n,θ=1n∑jδXj\.\\hat\{p\}\_\{n,\\theta\}=\\frac\{1\}\{n\}\\sum\_\{j\}\\delta\_\{X\_\{j\}\}\\,\.

### A\.2Optimal transport

###### Definition A\.2\.

The optimal transport problem for Euclidean cost, also called the 2\-Wasserstein distance is defined for two probability distributionsp,q∈𝒫2\(ℝd\)p,q\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)with finite second moments as

W22\(p,q\)\\displaystyle W\_\{2\}^\{2\}\(p,q\)=minT:T\#p=q⁡𝔼X∼p\[‖X−T\(X\)‖2\]\\displaystyle=\\min\_\{T:T\_\{\\\#\}p=q\}\\mathbb\{E\}\_\{X\\sim p\}\[\\\|X\-T\(X\)\\\|^\{2\}\]\(1\)=minπ∈Π\(p,q\)⁡𝔼π\[‖X−Y‖2\],\\displaystyle=\\min\_\{\\pi\\in\\Pi\(p,q\)\}\\mathbb\{E\}\_\{\\pi\}\[\\\|X\-Y\\\|^\{2\}\]\\,,\(2\)The first definition is defined as the Monge formulation\(Monge,[1781](https://arxiv.org/html/2605.06797#bib.bib38)\), and the second one as the Kantorovitch formulation\(Kantorovich,[1942](https://arxiv.org/html/2605.06797#bib.bib39)\), with the equivalence holding whenp,qp,qare absolutely continuous, or discrete uniform samples of the same finite size\.

Note that this distance can be approximated with sample access toppandqqby plugging directly these samples empirical measuresp^n\\hat\{p\}\_\{n\}of theXjX\_\{j\}andq^n\\hat\{q\}\_\{n\}of theYjY\_\{j\}\. However, this approach suffers from two issues: It suffers from a curse of dimensionality, and the convergence of the estimateW22\(p^n,q^n\)W\_\{2\}^\{2\}\(\\hat\{p\}\_\{n\},\\hat\{q\}\_\{n\}\)toW22\(p,q\)W\_\{2\}^\{2\}\(p,q\)is slow, inn−1/dn^\{\-1/d\}\(Dudley,[1969](https://arxiv.org/html/2605.06797#bib.bib1)\), it is slow to compute in general, with a worst case super cubic cost ofn3n^\{3\}for the Hungarian algorithm, and methods based on the Sinkhorn algorithm inn2log⁡\(n\)n^\{2\}\\log\(n\)\. The latter is motivated by an entropic\-regularized formulation\(Cuturi,[2013](https://arxiv.org/html/2605.06797#bib.bib32)\)

W2,ε2=minπ∈Π\(p,q\)⁡𝔼π\[‖X−Y‖2\]−εH\(π\)\.W^\{2\}\_\{2,\\varepsilon\}=\\min\_\{\\pi\\in\\Pi\(p,q\)\}\\mathbb\{E\}\_\{\\pi\}\[\\\|X\-Y\\\|^\{2\}\]\-\\varepsilon H\(\\pi\)\\,\.
An interesting exception is the one\-dimensional case: whend=1d=1, the solution of \([1](https://arxiv.org/html/2605.06797#A1.E1)\) is given by

W22\(p^n,q^n\)=1n∑j=1n\|sort\(x\)j−sort\(y\)j\|2W\_\{2\}^\{2\}\(\\hat\{p\}\_\{n\},\\hat\{q\}\_\{n\}\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\|\\text\{sort\}\(x\)\_\{j\}\-\\text\{sort\}\(y\)\_\{j\}\|^\{2\}\\,wheresort:ℝn→ℝn\\text\{sort\}:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}is the function that maps a vectorx∈ℝnx\\in\\mathbb\{R\}^\{n\}to its copy sorted in nondecreasing ordersort\(x\)=\(xσ\(1\),…,xσ\(n\)\)⊺\\text\{sort\}\(x\)=\(x\_\{\\sigma\(1\)\},\\ldots,x\_\{\\sigma\(n\)\}\)^\{\\intercal\}such thatxσ\(1\)≤…≤xσ\(n\)x\_\{\\sigma\(1\)\}\\leq\\ldots\\leq x\_\{\\sigma\(n\)\}\(ties do not make this function ambiguous\)\. It can therefore be computed in time of ordernlog⁡nn\\log n\. This can be leveraged ford\>1d\>1by considering the average Wasserstein distance over uniformly random unit directions\. This is called thesliced Wasserstein distance

###### Definition A\.3\.

The sliced Wasserstein distance\(Rabinet al\.,[2011](https://arxiv.org/html/2605.06797#bib.bib22); Nadjahi,[2021](https://arxiv.org/html/2605.06797#bib.bib23)\)is defined as the average of the Wasserstein distances over 1\-d projections alongu∼𝒰\(S\)u\\sim\\mathcal\{U\}\(S\)a uniformly random unit direction

SW22\(p,q\)=𝔼u∼𝒰\(S\)\[W22\(u⊺p,u⊺q\)\]SW\_\{2\}^\{2\}\(p,q\)=\\mathbb\{E\}\_\{u\\sim\\mathcal\{U\}\(S\)\}\[W\_\{2\}^\{2\}\(u^\{\\intercal\}p,u^\{\\intercal\}q\)\]\\,whereu⊺pu^\{\\intercal\}p\(resp\.u⊺qu^\{\\intercal\}q\) denotes the distribution ofu⊺Xu^\{\\intercal\}XwhenX∼pX\\sim p\(resp\.u⊺Yu^\{\\intercal\}YwhenY∼qY\\sim q\)\.

The sliced Wasserstein distance is still a distance between distributions\. It can also be easily estimated from samples, given empirical measuresp^n\\hat\{p\}\_\{n\}andq^n\\hat\{q\}\_\{n\}andMMi\.i\.d\. unit vectorsu1,…uMu\_\{1\},\\ldots u\_\{M\}

SW^2,M2\(p^n,q^n\)\\displaystyle\\hat\{SW\}^\{2\}\_\{2,M\}\(\\hat\{p\}\_\{n\},\\hat\{q\}\_\{n\}\)=1M∑i=1MW22\(ui⊺p^n,ui⊺q^n\)\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}W\_\{2\}^\{2\}\(u\_\{i\}^\{\\intercal\}\\hat\{p\}\_\{n\},u\_\{i\}^\{\\intercal\}\\hat\{q\}\_\{n\}\)=1Mn∑i=1M∑j=1n\|sort\(ui⊺X\)j−sort\(ui⊺Y\)j\|2,\\displaystyle=\\frac\{1\}\{Mn\}\\sum\_\{i=1\}^\{M\}\\sum\_\{j=1\}^\{n\}\|\\text\{sort\}\(u\_\{i\}^\{\\intercal\}X\)\_\{j\}\-\\text\{sort\}\(u\_\{i\}^\{\\intercal\}Y\)\_\{j\}\|^\{2\}\\,,One of the advantages of this approach is the relaxed computational load: computing Wasserstein distances in 1D only requires to sort all the elements in the sample, which can be done in order ofnlog⁡nn\\log ntime, and is highly parallelizable, allowing to performMMof these operations with little to no overhead, andMMprojections from dimensionddto 1, fornnpoints each time\.

In particular, under mild assumptions, the sample complexity of estimating the sliced Wasserstein distance does not depend on the dimension of the problem\(Nadjahiet al\.,[2020](https://arxiv.org/html/2605.06797#bib.bib29)\), in contrast to the standard Wasserstein distance for which the sample complexity grows exponentially with the dimension\.

We finally note that as inRabinet al\.\([2011](https://arxiv.org/html/2605.06797#bib.bib22)\), we consider the average of thesquared1\-D Wasserstein distances, and would do so for otherℓp\\ell\_\{p\},p≠2p\\neq 2norm costs\.

### A\.3Metric comparison

##### Remarks about FID

- \-The last formula is also found in the literature as the following, both are equal W22\(𝒩\(μX,ΣX\),𝒩\(μY,ΣY\)\)=‖μX−μY‖2\\displaystyle W\_\{2\}^\{2\}\(\\mathcal\{N\}\(\\mu\_\{X\},\\Sigma\_\{X\}\),\\mathcal\{N\}\(\\mu\_\{Y\},\\Sigma\_\{Y\}\)\)=\\\|\\mu\_\{X\}\-\\mu\_\{Y\}\\\|^\{2\}\+tr\(ΣX\+ΣY−2\(ΣY1/2ΣXΣY1/2\)1/2\)\.\\displaystyle\\quad\+\\text\{tr\}\(\\Sigma\_\{X\}\+\\Sigma\_\{Y\}\-2\(\\Sigma\_\{Y\}^\{1/2\}\\Sigma\_\{X\}\\Sigma\_\{Y\}^\{1/2\}\)^\{1/2\}\)\\,\.
- \-In practice, the expectations are obtained based on a finite sampleX1,…,XnX\_\{1\},\\ldots,X\_\{n\}from a generative model andY1,…,YnY\_\{1\},\\ldots,Y\_\{n\}from a dataset, and we actually compute the plug\-in estimate FID\(p^n,θ,p^n,data\)\\displaystyle\\text\{FID\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)=‖μ^X−μ^Y‖2\\displaystyle=\\\|\\hat\{\\mu\}\_\{X\}\-\\hat\{\\mu\}\_\{Y\}\\\|^\{2\}\+tr\(Σ^X\+Σ^Y−2\(Σ^YΣ^X\)1/2\)\.\\displaystyle\\quad\+\\text\{tr\}\(\\hat\{\\Sigma\}\_\{X\}\+\\hat\{\\Sigma\}\_\{Y\}\-2\(\\hat\{\\Sigma\}\_\{Y\}\\hat\{\\Sigma\}\_\{X\}\)^\{1/2\}\)\\,\.
- \-This distance is motivated by theWasserstein distanceW22W\_\{2\}^\{2\}\(see, e\.g\. Villani,[2008](https://arxiv.org/html/2605.06797#bib.bib20), and Appendix[A\.2](https://arxiv.org/html/2605.06797#A1.SS2)in this work\), obtained by solving an optimal transport problem, with a square Euclidean distance cost\. For FID, this distance is applied to two fitted Gaussian distributions rather than to the sample distributionsp^n,θ\\hat\{p\}\_\{n,\\theta\}andp^n,data\\hat\{p\}\_\{n,\\text\{data\}\}\.
- \-Using this method, rather than a sample\-based estimate ofW22\(p^n,θ,p^n,data\)W\_\{2\}^\{2\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\), allows to overcome the two main obstacles when using the Wasserstein distance between two distributions based on sample access:statisticalandcomputational complexity\. Computing the FID only requires to estimate the mean and covariance matrices, and to perform a conceptually simple, closed\-form computation\.

###### Definition A\.4\(mean FID\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\},

μX=𝔼pθ\[X\],μY=𝔼pdata\[Y\]\.\\mu\_\{X\}=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[X\]\\,,\\quad\\mu\_\{Y\}=\\mathbb\{E\}\_\{p\_\{\\text\{data\}\}\}\[Y\]\\,\.
The mean FID \(that we denote byμFID\\mu\\text\{FID\}\) is defined as

μFID\(pθ,pdata\)=‖μX−μY‖2\.\\mu\\text\{FID\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=\\\|\\mu\_\{X\}\-\\mu\_\{Y\}\\\|^\{2\}\\,\.

##### Remarks

- \-Much like for FID, it is also very easy to estimate the mean FID from a finite sample with the plug\-in empirical measuresμFID\(p^n,θ,p^n,data\)=‖μ^X−μ^Y‖2\\mu\\text\{FID\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)=\\\|\\hat\{\\mu\}\_\{X\}\-\\hat\{\\mu\}\_\{Y\}\\\|^\{2\}\.
- \-We show in Section[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)that the sample complexity ofμFID\\mu\\text\{FID\}is much lower than that ofFID\- this probably stems from the fact that only a vector of sizeddmust be evaluated rather than add\-by\-ddmatrix\.
- \-We show in Section[4\.5](https://arxiv.org/html/2605.06797#S4.SS5)that it is even less robust thanFID\.

###### Definition A\.5\(Sliced FID\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}, the sliced FID \(that we denote byσFID\\sigma\\text\{FID\}\) is defined as

σFID\(pθ,pdata\)=𝔼u∼𝒰\(S\)\[FID\(u⊺pθ,u⊺pdata\)\]\.\\sigma\\text\{FID\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=\\mathbb\{E\}\_\{u\\sim\\mathcal\{U\}\(S\)\}\[\\text\{FID\}\(u^\{\\intercal\}p\_\{\\theta\},u^\{\\intercal\}p\_\{\\text\{data\}\}\)\]\\,\.For finite samples\(Xj\)j∈\[n\]\(X\_\{j\}\)\_\{j\\in\[n\]\},\(Yj\)j∈\[n\]\(Y\_\{j\}\)\_\{j\\in\[n\]\}and\(ui\)i∈\[M\]\(u\_\{i\}\)\_\{i\\in\[M\]\}it can be estimated by

σFID\(p^n,θ,p^n,data\)\\displaystyle\\sigma\\text\{FID\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)=1M∑i=1MFID\(ui⊺p^n,ui⊺q^n\)\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\text\{FID\}\(u\_\{i\}^\{\\intercal\}\\hat\{p\}\_\{n\},u\_\{i\}^\{\\intercal\}\\hat\{q\}\_\{n\}\)=1M∑i=1M\{\(ui⊺μ^n,X−ui⊺μ^n,Y\)2\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\big\\\{\(u\_\{i\}^\{\\intercal\}\\hat\{\\mu\}\_\{n,X\}\-u\_\{i\}^\{\\intercal\}\\hat\{\\mu\}\_\{n,Y\}\)^\{2\}\+\(σ^n,ui⊺X−σ^n,ui⊺Y\)2\}\.\\displaystyle\\quad\+\(\\hat\{\\sigma\}\_\{n,u\_\{i\}^\{\\intercal\}X\}\-\\hat\{\\sigma\}\_\{n,u\_\{i\}^\{\\intercal\}Y\}\)^\{2\}\\big\\\}\\,\.

##### Remarks

- \-While very easy to estimate from samples, we also show that it suffers from the same robustness issues as the FID\.

###### Definition A\.6\(Sinkhorn divergence\(Genevayet al\.,[2018](https://arxiv.org/html/2605.06797#bib.bib31)\)\)\.

For two distributions, we denote byWε\(p,q\)W\_\{\\varepsilon\}\(p,q\)the value of the entropic\-regularized optimal transport problem betweenppandqq\(see, e\.g\. Peyré and Cuturi,[2019](https://arxiv.org/html/2605.06797#bib.bib21), and Appendix[A\.2](https://arxiv.org/html/2605.06797#A1.SS2)in this work\)\. LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}, the Sinkhorn Divergence Inception Distance \(SDID\) is defined by

SDIDε\(pθ,pdata\)=Wε\(pθ,pdata\)−12Wε\(pθ,pθ\)−12Wε\(pdata,pdata\)\.\\displaystyle\\text\{SDID\}\_\{\\varepsilon\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=W\_\{\\varepsilon\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)\-\\frac\{1\}\{2\}W\_\{\\varepsilon\}\(p\_\{\\theta\},p\_\{\\theta\}\)\-\\frac\{1\}\{2\}W\_\{\\varepsilon\}\(p\_\{\\text\{data\}\},p\_\{\\text\{data\}\}\)\\,\.

##### Remarks

- \-For finite samples, the empirical measuresp^n,θ,p^n,data\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}can be split inp^1,n,θ,p^2,n,θ,p^1,n,data,p^2,n,data\\hat\{p\}\_\{1,n,\\theta\},\\hat\{p\}\_\{2,n,\\theta\},\\hat\{p\}\_\{1,n,\\text\{data\}\},\\hat\{p\}\_\{2,n,\\text\{data\}\}and the divergence can be estimated by SDIDε\(p^n,θ,p^n,data\)\\displaystyle\\text\{SDID\}\_\{\\varepsilon\}\(\\hat\{p\}\_\{n,\\theta\},\\hat\{p\}\_\{n,\\text\{data\}\}\)=Wε\(p^1,n,θ,p^1,n,data\)\\displaystyle=W\_\{\\varepsilon\}\(\\hat\{p\}\_\{1,n,\\theta\},\\hat\{p\}\_\{1,n,\\text\{data\}\}\)−12Wε\(p^1,n,θ,p^2,n,θ\)\\displaystyle\\quad\-\\frac\{1\}\{2\}W\_\{\\varepsilon\}\(\\hat\{p\}\_\{1,n,\\theta\},\\hat\{p\}\_\{2,n,\\theta\}\)−12Wε\(p^1,n,data,p^2,n,data\)\.\\displaystyle\\quad\-\\frac\{1\}\{2\}W\_\{\\varepsilon\}\(\\hat\{p\}\_\{1,n,\\text\{data\}\},\\hat\{p\}\_\{2,n,\\text\{data\}\}\)\\,\.SDID can be computed by computing each entropic regularized optimal transport problem with a fast GPU\-friendly alternate projection method, called Sinkhorn’s algorithm\(Cuturi,[2013](https://arxiv.org/html/2605.06797#bib.bib32)\)\.
- \-In practice, to overcome a curse of dimensionality, we have found it better to estimate the correction term from two independent samplesp^1,n,p^2,n\\hat\{p\}\_\{1,n\},\\hat\{p\}\_\{2,n\}\. This concretely doubles the required sample size\. We have found this metric to be much more robust than FID, and to require a smaller sample size, but of a similar order \(ignoring this doubling\) \- see Section[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)\.

###### Definition A\.7\(Maximum mean discrepancy \- MMD\(Grettonet al\.,[2012](https://arxiv.org/html/2605.06797#bib.bib18); Jayasumanaet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib14)\)\)\.

LetX=ψw\(gθ\(Z\)\)∼pθX=\\psi\_\{w\}\(g\_\{\\theta\}\(Z\)\)\\sim p\_\{\\theta\}andY=ψw\(D\)∼pdataY=\\psi\_\{w\}\(D\)\\sim p\_\{\\text\{data\}\}, and the kernel functionkσ\(x,y\)=exp⁡\(−‖x−y‖2/σ\)k\_\{\\sigma\}\(x,y\)=\\exp\(\-\\\|x\-y\\\|^\{2\}/\\sigma\), forσ\>0\\sigma\>0, the MMD is defined as

MMD\(pθ,pdata\)\\displaystyle\\text\{MMD\}\(p\_\{\\theta\},p\_\{\\text\{data\}\}\)=𝔼pθ⊗pθ\[k\(x,x′\)\]−2𝔼pθ⊗pdata\[k\(x,y\)\]\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\theta\}\\otimes p\_\{\\theta\}\}\[k\(x,x^\{\\prime\}\)\]\-2\\mathbb\{E\}\_\{p\_\{\\theta\}\\otimes p\_\{\\text\{data\}\}\}\[k\(x,y\)\]\+𝔼pdata⊗pdata\[k\(y,y′\)\]\.\\displaystyle\\quad\+\\mathbb\{E\}\_\{p\_\{\\text\{data\}\}\\otimes p\_\{\\text\{data\}\}\}\[k\(y,y^\{\\prime\}\)\]\\,\.

##### Remarks

- \-Since it is defined as a two\-sample mean, the MMD can also be estimated quickly from empirical distributionsp^n,θ\\hat\{p\}\_\{n,\\theta\}andp^n,data\\hat\{p\}\_\{n,\\text\{data\}\}\.
- \-One of the drawbacks of this metric is the need to select a hyperparameterσ\>0\\sigma\>0\.
- \-Another drawback is the computational aspect, as ann×nn\\times nkernel matrix must be computed\.
- \-The Sinkhorn divergence and MMD are related: whenε→\+∞\\varepsilon\\to\+\\infty, we have that SDIDε→12MMD−∥⋅∥2\\text\{SDID\}\_\{\\varepsilon\}\\to\\frac\{1\}\{2\}\\text\{MMD\}\_\{\-\\\|\\cdot\\\|^\{2\}\}\\,where the kernel functionkkis given by the negative squared Euclidean distance \(rather than a Gaussian kernel\)\.
- \-The use of this metric, with another embedding network, is recommended in\(Jayasumanaet al\.,[2024](https://arxiv.org/html/2605.06797#bib.bib14)\)\.

## Appendix BProofs

### B\.1Proof of Proposition 4\.1

###### Proof\.

Recall that the target distributionpphas meanμ\\muand covarianceΣ\\Sigma\. We construct the discrete distributionq^\\hat\{q\}using2r2rvectorsvi\(\+\)=μ\+αuiv\_\{i\}^\{\(\+\)\}=\\mu\+\\alpha u\_\{i\}andvi\(−\)=μ−αuiv\_\{i\}^\{\(\-\)\}=\\mu\-\\alpha u\_\{i\}, where each vector is assigned probabilityπi=λi/\(2tr⁡\(Σ\)\)\\pi\_\{i\}=\\lambda\_\{i\}/\(2\\operatorname\{tr\}\(\\Sigma\)\), andα=tr⁡\(Σ\)\\alpha=\\sqrt\{\\operatorname\{tr\}\(\\Sigma\)\}\.

##### 1\. Mean ofq^\\hat\{q\}

By the definition of expectation for a discrete distribution:

𝔼q^\[v\]\\displaystyle\\mathbb\{E\}\_\{\\hat\{q\}\}\[v\]=∑i=1rπivi\(\+\)\+∑i=1rπivi\(−\)\\displaystyle=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}v\_\{i\}^\{\(\+\)\}\+\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}v\_\{i\}^\{\(\-\)\}\(3\)=∑i=1rπi\(μ\+αui\)\+∑i=1rπi\(μ−αui\)\\displaystyle=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(\\mu\+\\alpha u\_\{i\}\)\+\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(\\mu\-\\alpha u\_\{i\}\)\(4\)=∑i=1rπi\(2μ\)=μ∑i=1r2πi\.\\displaystyle=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(2\\mu\)=\\mu\\sum\_\{i=1\}^\{r\}2\\pi\_\{i\}\.\(5\)Since2πi=λi/tr⁡\(Σ\)2\\pi\_\{i\}=\\lambda\_\{i\}/\\operatorname\{tr\}\(\\Sigma\)and∑i=1rλi=tr⁡\(Σ\)\\sum\_\{i=1\}^\{r\}\\lambda\_\{i\}=\\operatorname\{tr\}\(\\Sigma\), it follows that∑2πi=1\\sum 2\\pi\_\{i\}=1, hence𝔼q^\[v\]=μ\\mathbb\{E\}\_\{\\hat\{q\}\}\[v\]=\\mu\.

##### 2\. Covariance ofq^\\hat\{q\}

The covariance ofq^\\hat\{q\}is given by𝔼q^\[\(v−μ\)\(v−μ\)⊺\]\\mathbb\{E\}\_\{\\hat\{q\}\}\[\(v\-\\mu\)\(v\-\\mu\)^\{\\intercal\}\]:

Covq^\(v\)\\displaystyle\\text\{Cov\}\_\{\\hat\{q\}\}\(v\)=∑i=1rπi\(vi\(\+\)−μ\)\(vi\(\+\)−μ\)⊺\+∑i=1rπi\(vi\(−\)−μ\)\(vi\(−\)−μ\)⊺\\displaystyle=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(v\_\{i\}^\{\(\+\)\}\-\\mu\)\(v\_\{i\}^\{\(\+\)\}\-\\mu\)^\{\\intercal\}\+\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(v\_\{i\}^\{\(\-\)\}\-\\mu\)\(v\_\{i\}^\{\(\-\)\}\-\\mu\)^\{\\intercal\}\(6\)=∑i=1rπi\(αui\)\(αui\)⊺\+∑i=1rπi\(−αui\)\(−αui\)⊺\\displaystyle=\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(\\alpha u\_\{i\}\)\(\\alpha u\_\{i\}\)^\{\\intercal\}\+\\sum\_\{i=1\}^\{r\}\\pi\_\{i\}\(\-\\alpha u\_\{i\}\)\(\-\\alpha u\_\{i\}\)^\{\\intercal\}\(7\)=∑i=1r2πiα2uiui⊺\.\\displaystyle=\\sum\_\{i=1\}^\{r\}2\\pi\_\{i\}\\alpha^\{2\}u\_\{i\}u\_\{i\}^\{\\intercal\}\.\(8\)Substitutingα2=tr⁡\(Σ\)\\alpha^\{2\}=\\operatorname\{tr\}\(\\Sigma\)and2πi=λi/tr⁡\(Σ\)2\\pi\_\{i\}=\\lambda\_\{i\}/\\operatorname\{tr\}\(\\Sigma\):

Covq^\(v\)\\displaystyle\\text\{Cov\}\_\{\\hat\{q\}\}\(v\)=∑i=1r\(λitr⁡\(Σ\)\)tr⁡\(Σ\)uiui⊺\\displaystyle=\\sum\_\{i=1\}^\{r\}\\left\(\\frac\{\\lambda\_\{i\}\}\{\\operatorname\{tr\}\(\\Sigma\)\}\\right\)\\operatorname\{tr\}\(\\Sigma\)u\_\{i\}u\_\{i\}^\{\\intercal\}\(9\)=∑i=1rλiuiui⊺=Σ\.\\displaystyle=\\sum\_\{i=1\}^\{r\}\\lambda\_\{i\}u\_\{i\}u\_\{i\}^\{\\intercal\}=\\Sigma\.\(10\)

##### 3\. FID Value

TheFIDbetween two distributions is defined as the 2\-Wasserstein distance between their associated Gaussians\(Heuselet al\.,[2017](https://arxiv.org/html/2605.06797#bib.bib30)\)\. Consequently,FIDis strictly a function of the first two moments\. Since𝔼q^\[v\]=μp\\mathbb\{E\}\_\{\\hat\{q\}\}\[v\]=\\mu\_\{p\}andCovq^\(v\)=Σp\\text\{Cov\}\_\{\\hat\{q\}\}\(v\)=\\Sigma\_\{p\}, the means and covariances match exactly:

FID\(q^,p\)\\displaystyle\\text\{FID\}\(\\hat\{q\},p\)=‖μ−μ‖2\+tr⁡\(Σ\+Σ−2\(ΣΣ\)1/2\)\\displaystyle=\\\|\\mu\-\\mu\\\|^\{2\}\+\\operatorname\{tr\}\(\\Sigma\+\\Sigma\-2\(\\Sigma\\Sigma\)^\{1/2\}\)\(11\)=0\+tr⁡\(2Σ−2Σ\)=0\.\\displaystyle=0\+\\operatorname\{tr\}\(2\\Sigma\-2\\Sigma\)=0\.\(12\)This concludes the proof\. ∎

## Appendix CAdditional results

![Refer to caption](https://arxiv.org/html/2605.06797v1/x13.png)\(a\)Variance of theMIND\\operatorname\{\\texttt\{MIND\}\}with different number of projectionsMM\.
![Refer to caption](https://arxiv.org/html/2605.06797v1/x14.png)\(b\)Peak memory used for calculating different metrics\.

### C\.1Effects of number of projections

As shown in Figure[10\(a\)](https://arxiv.org/html/2605.06797#A3.F10.sf1), our empirical analysis shows that the variance is not affected at smaller scales than numerical artifacts forM\>1000M\>1000\. We also show that usingM=100M=100yields almost the same performance, while there is a substantial degradation when usingM=10M=10\.

### C\.2Peak Memory

The measurements include both memory occupied by the input data and the temporary memory required for metric computation\. As shown in Figure[10\(b\)](https://arxiv.org/html/2605.06797#A3.F10.sf2),MIND\\operatorname\{\\texttt\{MIND\}\}at its recommended sample size \(n=5n=5k\) requires less memory than MMD and FID\.

### C\.3Computation resources and experimental details

We run each experiment in Section[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)for22hours using44TPUv55e and each experiment in Section[4\.5](https://arxiv.org/html/2605.06797#S4.SS5)for1010minutes using44TPUv55e\. The diffusion model used in Section[4\.4](https://arxiv.org/html/2605.06797#S4.SS4)is trained with55M steps on ImageNet\-64, we summarize other details for its training and sampling in Table[2](https://arxiv.org/html/2605.06797#A3.T2)\.

\\cprotect

Table 2:Hyperparameters for training and sampling from diffusion models\.
### C\.4Moment\-matching hacking

We illustrate the results of the experiment described in Section[4\.5](https://arxiv.org/html/2605.06797#S4.SS5)in Figure[11](https://arxiv.org/html/2605.06797#A3.F11)

![Refer to caption](https://arxiv.org/html/2605.06797v1/figures/hacking.png)Figure 11:Two elements of the batch, all initial images are the same\. \(Left\) Initial image \(Center\) Image after optimization \(Right\) Difference scaled by a factor 100 \(to become visible\)\.
MIND: Monge Inception Distance for Generative Models Evaluation

Similar Articles

Exploring Spatial Intelligence from a Generative Perspective

Representation Fréchet Loss for Visual Generation

MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Mind DeepResearch Technical Report

Submit Feedback

Similar Articles

Exploring Spatial Intelligence from a Generative Perspective
Representation Fréchet Loss for Visual Generation
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Mind DeepResearch Technical Report