The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation
Summary
This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.
View Cached Full Text
Cached at: 06/20/26, 02:27 PM
Paper page - The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation
Source: https://huggingface.co/papers/2606.20536
Abstract
Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and optimal guidance tuning.
TheFrechet Inception Distance(FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treatFIDas a random variable on a two-axis panel of training andgeneration seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed movesFID3.2x more (inInception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of theflow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding theFIDcoefficient of variation(CoV) inside a 1-2% band. (d) Per-cellclassifier-free-guidancetuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the sameFIDwith up to 2x less compute than an unlucky one. Based on these findings, we recommend a newFIDevaluation protocol: evaluate under per-cell optimal guidance, treat anyFIDgap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over severaltraining seedsrather than a singleFIDnumber.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.20536
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20536 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20536 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20536 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Representation Fréchet Loss for Visual Generation
This paper introduces FD-loss, a method to optimize Fréchet Distance as a training objective for visual generation by decoupling population and batch sizes. It demonstrates that this approach improves generator quality and suggests FID may not always accurately reflect visual quality.
MIND: Monge Inception Distance for Generative Models Evaluation
This paper introduces MIND (Monge Inception Distance), a new metric for evaluating generative models that is more sample-efficient, faster, and robust than the standard Fréchet Inception Distance (FID).
Improved Techniques for Training Consistency Models
OpenAI presents improved techniques for training consistency models that enable high-quality single-step image generation without distillation, achieving significant FID improvements on CIFAR-10 and ImageNet 64×64 through novel loss functions and training strategies.
On the quantitative analysis of decoder-based generative models
This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
A local distribution-aware detection framework that amplifies micro-scale statistical irregularities to identify AI-generated images with improved accuracy, outperforming baseline detectors across benchmarks.