コモンセンス-T2Iチャレンジ：テキストから画像生成モデルはコモンセンスを理解できるか？

要旨

私たちは、テキストから画像（T2I）生成モデルが現実世界の常識に合致した画像を生成する能力を評価するための新しいタスクとベンチマークを提案します。これを「Commonsense-T2I」と呼びます。例えば、「電気のない電球」と「電気のある電球」というように、同じ動作語を含むがわずかに異なる2つの対立的なテキストプロンプトを与え、T2Iモデルが視覚的常識推論を行えるかどうかを評価します。具体的には、「電球が消えている」と「電球が点灯している」という対応する画像を生成できるかどうかを検証します。Commonsense-T2Iは対立的な課題を提示し、ペアワイズのテキストプロンプトと期待される出力を提供します。このデータセットは専門家によって慎重に手作業でキュレーションされ、常識のタイプや期待される出力の可能性などの細かいラベルが付与されており、モデルの挙動を分析するのに役立ちます。私たちは、さまざまな最先端（SOTA）のT2Iモデルをベンチマークし、驚くべきことに、画像合成と現実世界の写真との間には依然として大きなギャップがあることを発見しました。例えば、DALL-E 3モデルでさえCommonsense-T2Iで48.92%しか達成できず、Stable Diffusion XLモデルはわずか24.92%の精度しか達成できませんでした。私たちの実験では、GPTを活用したプロンプトでもこの課題を解決できないことが示され、その欠陥の可能性のある理由について詳細な分析を行いました。私たちは、Commonsense-T2IがT2Iの常識チェックのための高品質な評価ベンチマークとして機能し、現実世界の画像生成の進歩を促進することを目指しています。

English

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

コモンセンス-T2Iチャレンジ：テキストから画像生成モデルはコモンセンスを理解できるか？

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

要旨

Support