FIDの再考：画像生成のためのより優れた評価指標に向けて

要旨

多くの機械学習問題と同様に、画像生成手法の進展は優れた評価指標にかかっている。最も広く使われている指標の一つが、Frechet Inception Distance（FID）である。FIDは、実画像のInception-v3特徴量の分布と、アルゴリズムによって生成された画像のそれとの間の距離を推定する。我々はFIDの重要な欠点を指摘する：現代のテキストから画像へのモデルが生成する豊かで多様な内容をInceptionが十分に表現できないこと、誤った正規性の仮定、そしてサンプル複雑性の低さである。我々は、生成画像の主要な品質指標としてFIDを使用することの再評価を求める。我々は、FIDが人間の評価者と矛盾すること、反復的なテキストから画像へのモデルの漸進的な改善を反映しないこと、歪みレベルを捉えないこと、そしてサンプルサイズを変えると一貫しない結果を生み出すことを実証的に示す。また、我々は、より豊富なCLIP埋め込みとガウシアンRBFカーネルを用いた最大平均不一致距離に基づく新しい代替指標、CMMDを提案する。これは、埋め込みの確率分布について何の仮定もせず、サンプル効率の良い不偏推定量である。広範な実験と分析を通じて、テキストから画像へのモデルのFIDベースの評価が信頼できない可能性があること、そしてCMMDが画像品質のより堅牢で信頼性の高い評価を提供することを示す。

English

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

FIDの再考：画像生成のためのより優れた評価指標に向けて

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

要旨

Support