FID 재고: 이미지 생성을 위한 더 나은 평가 지표를 향하여

초록

많은 기계 학습 문제와 마찬가지로, 이미지 생성 방법의 발전은 좋은 평가 지표에 달려 있다. 그중 가장 널리 사용되는 것 중 하나는 프레셰 인셉션 거리(Frechet Inception Distance, FID)이다. FID는 실제 이미지의 Inception-v3 특징 분포와 알고리즘이 생성한 이미지의 특징 분포 간의 거리를 추정한다. 본 논문에서는 FID의 중요한 단점을 강조한다: 현대 텍스트-이미지 모델이 생성하는 풍부하고 다양한 콘텐츠에 대한 Inception의 부적절한 표현, 잘못된 정규성 가정, 그리고 낮은 샘플 복잡성. 우리는 생성된 이미지의 주요 품질 지표로서 FID의 사용을 재평가할 것을 요구한다. 실험적으로 FID가 인간 평가자와 상반된 결과를 내며, 반복적인 텍스트-이미지 모델의 점진적인 개선을 반영하지 못하고, 왜곡 수준을 포착하지 못하며, 샘플 크기를 변화시킬 때 일관되지 않은 결과를 생성함을 입증한다. 또한, 우리는 더 풍부한 CLIP 임베딩과 가우시안 RBF 커널을 사용한 최대 평균 불일치 거리(maximum mean discrepancy distance)를 기반으로 한 새로운 대안 지표인 CMMD를 제안한다. 이는 임베딩의 확률 분포에 대한 어떠한 가정도 하지 않는 편향 없는 추정자이며 샘플 효율적이다. 광범위한 실험과 분석을 통해, 텍스트-이미지 모델에 대한 FID 기반 평가가 신뢰할 수 없을 수 있으며, CMMD가 이미지 품질에 대한 더 강력하고 신뢰할 수 있는 평가를 제공함을 입증한다.

English

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

FID 재고: 이미지 생성을 위한 더 나은 평가 지표를 향하여

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

초록

Support