FID 복권: 생성 모델 평가에서 숨겨진 무작위성 정량화

초록

프레셰 인셉션 거리(Fréchet Inception Distance, FID)는 이미지 생성 분야의 사실상의 평가 기준이지만, 대부분의 논문은 단일 학습 모델에서 단일 샘플링 시드를 사용해 얻은 단일 숫자만을 보고한다. 모델을 재학습하거나 단순히 재샘플링할 경우, 그 숫자는 얼마나 재현 가능할까? 본 논문에서는 FID를 학습 및 생성 시드의 두 축 패널 위에서의 확률 변수로 간주하고, 클래스 조건부 ImageNet 256x256에서 학습된 수백 개의 SiT 네트워크에 대해 그 분산을 직접 측정한다. 우리는 놀라운 결과를 발견했다: (a) 동일한 레시피로 다른 시드를 사용해 모델을 재학습하면, 고정된 네트워크에서 표본을 다시 추출하는 것보다 FID가 (인셉션 특징 공간에서) 3.2배 더 크게 변동한다. (b) 이러한 차이는 무작위 초기화, 데이터 순서, 그리고 흐름 매칭 손실의 단계별 가우시안 노이즈라는 세 가지 요인에 의해 발생한다. (c) 계산량이나 모델 크기를 늘려도 변동 폭이 거의 줄어들지 않으며, FID의 변동 계수(CoV)는 1-2% 범위 내에 머문다. (d) 셀별 무분류자 안내 조정(per-cell classifier-free-guidance tuning)은 변동 폭을 절반으로 줄이지만 어떤 시드가 가장 잘 작동하는지 재배열하며, 운 좋은 학습 시드는 운 나쁜 시드보다 최대 2배 적은 계산량으로 동일한 FID에 도달한다. 이러한 발견을 바탕으로, 우리는 새로운 FID 평가 프로토콜을 제안한다: 셀별 최적 안내 하에 평가하고, 경험적으로 측정된 약 1.3% CoV 미만의 FID 차이는 불확실한 것으로 간주하며, 단일 FID 숫자 대신 여러 학습 시드에 대한 오차 막대를 보고한다.

English

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.