思考よりも簡単な絵画：テキストから画像生成モデルは舞台を設定できるが、演出はできないのか？

要旨

テキストから画像（T2I）生成は、テキストプロンプトから画像を合成することを目的としており、プロンプトは何を示すべきかを指定すると同時に、何を推論できるかを暗示するものであり、これらは構成（composition）と推論（reasoning）という2つの核心的な能力に対応しています。しかし、T2Iモデルの推論能力が構成を超えて進化するにつれ、既存のベンチマークでは、これらの能力を包括的かつ詳細に評価する際に明らかな限界が露呈しています。また、これらの進化により、モデルはより複雑なプロンプトを処理できるようになりましたが、現在のベンチマークは低いシーン密度や単純な一対一の推論に限定されたままです。これらの限界を解決するため、我々はT2I-CoReBenchを提案します。これは、T2Iモデルの構成と推論の両方の能力を評価するための包括的かつ複雑なベンチマークです。包括性を確保するため、構成をシーングラフ要素（インスタンス、属性、関係）に基づいて構造化し、推論を哲学的な推論フレームワーク（演繹的、帰納的、仮説的）に基づいて整理し、12次元の評価分類を策定しました。複雑性を高めるため、現実世界のシナリオに内在する複雑さを反映し、各プロンプトを高い構成密度で構成し、推論のためには多段階の推論を組み込みました。また、各プロンプトに対応するチェックリストを作成し、意図した各要素を独立して評価するための個別のYes/No質問を指定し、細粒度かつ信頼性の高い評価を可能にしました。統計的には、我々のベンチマークは1,080の挑戦的なプロンプトと約13,500のチェックリスト質問で構成されています。27の最新T2Iモデルを対象とした実験では、複雑な高密度シナリオにおける構成能力が依然として限定的である一方、推論能力はさらに遅れており、プロンプトから暗黙の要素を推論する際にすべてのモデルが苦戦する重要なボトルネックとなっています。プロジェクトページ: https://t2i-corebench.github.io/。

English

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.

思考よりも簡単な絵画：テキストから画像生成モデルは舞台を設定できるが、演出はできないのか？

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

要旨

Support