FID彩票：量化生成模型评估中的隐藏随机性

摘要

弗雷歇初始距离（FID）是图像生成领域事实上的评判标准，然而大多数论文仅报告单个训练模型在单个采样种子下得出的单个数值。如果我们重新训练模型，或者仅从该模型中重新采样，这个数字的可重复性如何？在本文中，我们将FID视为一个在训练种子和生成种子构成的二维面板上的随机变量，并直接在数百个基于类别条件ImageNet 256x256训练的SiT网络上测量其方差。我们报告了令人惊讶的发现：(a) 使用相同配方但不同种子重新训练模型，其FID变化幅度（在Inception特征空间中）比固定网络重新采样得到的变化幅度大3.2倍。(b) 这一差距由三个因素驱动：随机初始化、数据排序以及流匹配损失中每步的高斯噪声。(c) 增加计算量或模型规模几乎无法缩小离散程度，使得FID变异系数（CoV）维持在1-2%的区间内。(d) 对每个单元进行无分类器引导调优可将离散程度减半，但会重新排列哪种种子表现最佳，而一个幸运的训练种子达到相同FID所需的计算量可比不幸运的种子减少多达2倍。基于这些发现，我们推荐一种新的FID评估协议：在每单元最优引导下进行评估，将低于经验测量值约1.3% CoV的任何FID差距视为不确定，并通过报告多个训练种子下的误差条而非单个FID数值。

English

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.