ChatPaper.aiChatPaper

常识-T2I挑战:文本到图像生成模型能否理解常识?

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

June 11, 2024
作者: Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth
cs.AI

摘要

我们提出了一个新颖的任务和基准,用于评估文本到图像(T2I)生成模型产生符合现实生活常识的图像的能力,我们称之为Commonsense-T2I。给定包含相同动作词集合但存在细微差异的两个对抗性文本提示,例如“没有电的灯泡”和“有电的灯泡”,我们评估T2I模型是否能进行视觉常识推理,例如产生符合“灯泡未点亮”与“灯泡已点亮”相应的图像。Commonsense-T2I提出了一个对抗性挑战,提供成对的文本提示以及期望的输出。该数据集由专家精心筛选并注释了细粒度标签,如常识类型和期望输出的可能性,以帮助分析模型行为。我们对各种最先进的T2I模型进行基准测试,令人惊讶地发现,图像合成与真实照片之间仍存在很大差距——即使是DALL-E 3模型在Commonsense-T2I上也只能达到48.92%,而稳定扩散XL模型仅实现24.92%的准确率。我们的实验表明,GPT增强提示无法解决这一挑战,并对可能导致此类不足的原因进行了详细分析。我们希望Commonsense-T2I能够成为T2I常识检查的高质量评估基准,促进现实生活图像生成的进展。
English
We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

Summary

AI-Generated Summary

PDF91December 6, 2024