The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Hugging Face Daily Papers 06/18/26, 12:00 AM Papers

fid generative-models evaluation reproducibility image-generation randomness statistical-analysis

Summary

This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

Original Article

View Cached Full Text

Cached at: 06/20/26, 02:27 PM

Paper page - The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Source: https://huggingface.co/papers/2606.20536

Abstract

Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and optimal guidance tuning.

TheFrechet Inception Distance(FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treatFIDas a random variable on a two-axis panel of training andgeneration seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed movesFID3.2x more (inInception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of theflow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding theFID coefficient of variation(CoV) inside a 1-2% band. (d) Per-cellclassifier-free-guidancetuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the sameFIDwith up to 2x less compute than an unlucky one. Based on these findings, we recommend a newFIDevaluation protocol: evaluate under per-cell optimal guidance, treat anyFIDgap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over severaltraining seedsrather than a singleFIDnumber.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.20536

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20536 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.20536 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20536 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Paper page - The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Representation Fréchet Loss for Visual Generation

MIND: Monge Inception Distance for Generative Models Evaluation

Improved Techniques for Training Consistency Models

On the quantitative analysis of decoder-based generative models

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Submit Feedback

Similar Articles

Representation Fréchet Loss for Visual Generation

MIND: Monge Inception Distance for Generative Models Evaluation

Improved Techniques for Training Consistency Models

On the quantitative analysis of decoder-based generative models

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts