重新思考FID：朝着更好的图像生成评估指标

摘要

与许多机器学习问题一样，图像生成方法的进展取决于良好的评估指标。其中最流行的之一是Frechet Inception Distance（FID）。FID估计真实图像的Inception-v3特征分布与算法生成图像特征分布之间的距离。我们强调FID存在重要缺陷：Inception对现代文本到图像模型生成的丰富多样内容的表征不足，错误的正态假设以及样本复杂度低。我们呼吁重新评估FID作为生成图像的主要质量指标的使用。我们凭经验证明，FID与人工评分者相矛盾，不反映迭代文本到图像模型逐渐改进的情况，不捕捉失真水平，并且在改变样本大小时产生不一致的结果。我们还提出了一种新的替代指标CMMD，基于更丰富的CLIP嵌入和高斯RBF核的最大均值差距距离。它是一个无偏估计量，不对嵌入的概率分布做任何假设，并且具有高样本效率。通过大量实验和分析，我们证明基于FID对文本到图像模型的评估可能不可靠，而CMMD提供了更健壮可靠的图像质量评估。

English

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

重新思考FID：朝着更好的图像生成评估指标

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

摘要

Support