自我评估：利用生成模型的判别性质进行评估

摘要

在这项工作中，我们展示了文本到图像生成模型可以被“反演”，以完全自动化的方式评估它们自身的文本-图像理解能力。我们的方法名为SelfEval，利用生成模型计算给定文本提示时真实图像的可能性，使生成模型直接适用于判别任务。利用SelfEval，我们重新利用用于评估多模态文本-图像判别模型的标准数据集，以细粒度的方式评估生成模型：评估它们在属性绑定、颜色识别、计数、形状识别、空间理解等方面的表现。据我们所知，SelfEval是第一个自动化度量标准，对于测量文本忠实度在多个模型和基准测试中与黄金标准人类评估表现出高度一致。此外，SelfEval使我们能够评估生成模型在挑战性任务上的表现，例如Winoground图像评分，在这些任务中它们展现出与判别模型竞争性的表现。我们还展示了标准自动化度量标准，如CLIP-score，在诸如DrawBench之类基准测试中衡量文本忠实度的严重缺陷，以及SelfEval如何规避这些问题。我们希望SelfEval能够为扩散模型提供简单可靠的自动化评估。

English

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

自我评估：利用生成模型的判别性质进行评估

SelfEval: Leveraging the discriminative nature of generative models for evaluation

摘要

Support