SelfEval: 生成モデルの識別特性を活用した評価手法

要旨

本研究では、テキストから画像を生成するモデルが「逆転」されることで、そのテキストと画像の理解能力を完全に自動化された方法で評価できることを示します。私たちの手法「SelfEval」は、生成モデルを使用してテキストプロンプトが与えられた際の実画像の尤度を計算し、生成モデルを識別タスクに直接適用可能にします。 SelfEvalを用いることで、マルチモーダルなテキストと画像の識別モデルを評価するために作成された標準データセットを再利用し、生成モデルを詳細に評価します。具体的には、属性の結合、色の認識、数の認識、形状の認識、空間的理解といった側面での性能を評価します。私たちの知る限り、SelfEvalは、複数のモデルとベンチマークにおいて、テキストの忠実度を測定する際に、ゴールドスタンダードである人間の評価と高い一致を示す初めての自動化された指標です。さらに、SelfEvalは、Winoground画像スコアのような挑戦的なタスクにおいて生成モデルを評価することを可能にし、識別モデルと競合する性能を示します。また、DrawBenchのようなベンチマークにおいて、CLIPスコアのような標準的な自動化された指標がテキストの忠実度を測定する際に抱える重大な欠点と、SelfEvalがこれらの問題を回避する方法を示します。私たちは、SelfEvalが拡散モデルのための簡単で信頼性の高い自動評価を可能にすることを期待しています。

English

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

SelfEval: 生成モデルの識別特性を活用した評価手法

SelfEval: Leveraging the discriminative nature of generative models for evaluation

要旨

Support