自我評估：利用生成模型的歧視性特質進行評估

摘要

在這份研究中，我們展示了文本到圖像生成模型可以被「反轉」，以完全自動化的方式評估它們自身的文本-圖像理解能力。我們的方法名為自我評估（SelfEval），利用生成模型計算給定文本提示時真實圖像的可能性，使生成模型直接應用於區分任務。利用自我評估，我們重新運用為評估多模態文本-圖像區分模型而創建的標準數據集，以細緻的方式評估生成模型：評估它們在屬性綁定、顏色識別、計數、形狀識別、空間理解等方面的表現。據我們所知，自我評估是第一個自動化指標，對於測量文本忠實度在多個模型和基準測試中與黃金標準人類評估具有高度一致性。此外，自我評估使我們能夠評估生成模型在具有挑戰性的任務上，例如Winoground圖像得分，在這些任務中它們展現出與區分模型競爭性的表現。我們還展示了標準自動化指標（如CLIP-score）在評估DrawBench等基準測試中測量文本忠實度時存在嚴重缺陷，以及自我評估如何避開這些問題。我們希望自我評估能夠為擴散模型提供方便且可靠的自動化評估。

English

In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

自我評估：利用生成模型的歧視性特質進行評估

SelfEval: Leveraging the discriminative nature of generative models for evaluation

摘要

Support