The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Hugging Face Daily Papers Papers

Summary

This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:27 PM

Paper page - The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Source: https://huggingface.co/papers/2606.20536

Abstract

Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and optimal guidance tuning.

TheFrechet Inception Distance(FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treatFIDas a random variable on a two-axis panel of training andgeneration seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed movesFID3.2x more (inInception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of theflow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding theFIDcoefficient of variation(CoV) inside a 1-2% band. (d) Per-cellclassifier-free-guidancetuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the sameFIDwith up to 2x less compute than an unlucky one. Based on these findings, we recommend a newFIDevaluation protocol: evaluate under per-cell optimal guidance, treat anyFIDgap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over severaltraining seedsrather than a singleFIDnumber.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.20536

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20536 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.20536 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20536 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Representation Fréchet Loss for Visual Generation

Papers with Code Trending

This paper introduces FD-loss, a method to optimize Fréchet Distance as a training objective for visual generation by decoupling population and batch sizes. It demonstrates that this approach improves generator quality and suggests FID may not always accurately reflect visual quality.

Improved Techniques for Training Consistency Models

OpenAI Blog

OpenAI presents improved techniques for training consistency models that enable high-quality single-step image generation without distillation, achieving significant FID improvements on CIFAR-10 and ImageNet 64×64 through novel loss functions and training strategies.

On the quantitative analysis of decoder-based generative models

OpenAI Blog

This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.