常識-T2I挑戰：文字轉圖像生成模型能理解常識嗎？

摘要

我們提出了一個新的任務和基準，用於評估文本生成圖像（T2I）模型產生符合現實常識的圖像能力，我們稱之為Commonsense-T2I。給定兩個對抗性文本提示，包含一組行動詞並帶有細微差異，例如“沒有電的燈泡”和“有電的燈泡”，我們評估T2I模型是否能進行視覺常識推理，例如產生符合“燈泡未點亮”和“燈泡已點亮”對應的圖像。Commonsense-T2I提出了一個對抗性挑戰，提供成對的文本提示以及期望的輸出。該數據集由專家精心手工策劃，並標註了細粒度標籤，如常識類型和預期輸出的可能性，以幫助分析模型行為。我們對各種最先進的T2I模型進行基準測試，令人驚訝地發現，圖像合成與現實照片之間仍存在很大差距——即使是DALL-E 3模型在Commonsense-T2I上也只能達到48.92％，而穩定的擴散XL模型僅實現24.92％的準確性。我們的實驗表明，GPT增強提示無法解決這一挑戰，我們對此不足的可能原因進行了詳細分析。我們希望Commonsense-T2I能夠成為T2I常識檢查的高質量評估基準，促進現實生活圖像生成的進步。

English

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

常識-T2I挑戰：文字轉圖像生成模型能理解常識嗎？

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

摘要

Support