FIDロッタリー：生成モデル評価における隠れたランダム性の定量化

要旨

フレシェ・インセプション距離（FID）は画像生成の事実上の評価基準であるが、ほとんどの論文では単一のトレーニングシードを用いた単一の訓練済みモデルから得られる一つの数値のみを報告している。もしモデルを再訓練したり、単にそこから再サンプリングしたりした場合、その数値はどの程度再現可能だろうか？本論文では、FIDを訓練シードと生成シードの二次元軸上の確率変数として扱い、クラス条件付きImageNet 256x256で訓練された数百のSiTネットワークに対してその分散を直接測定する。以下の驚くべき知見を報告する：(a) 同じレシピで異なるシードを用いてモデルを再訓練すると、固定ネットワークからのサンプルを引き直す場合よりもFIDが（インセプション特徴空間で）3.2倍大きく変動する。(b) その差は、ランダム初期化、データ順序、フローマッチング損失におけるステップごとのガウスノイズの三つの要因によって引き起こされる。(c) 計算量やモデルサイズを増やしてもばらつきはほとんど縮まらず、FIDの変動係数（CoV）は1〜2%の範囲内に留まる。(d) セルごとの分類器なしガイダンスチューニングはばらつきを半減させるが、どのシードが最適かを並べ替え、幸運な訓練シードは不運なシードに比べて最大2倍少ない計算量で同じFIDに到達する。これらの知見に基づき、新たなFID評価プロトコルを推奨する：セルごとの最適ガイダンスのもとで評価し、経験的に測定された約1.3%のCoV以下のFID差は決定的でないとみなし、単一のFID数値ではなく複数の訓練シードにわたる誤差範囲を報告する。

English

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.