繪畫易於思考:文本到圖像模型能否搭建舞台,卻無法執導戲劇?
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
September 3, 2025
作者: Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Fuli Feng
cs.AI
摘要
文本到圖像(T2I)生成旨在從文字提示中合成圖像,這些提示共同指定了必須展示的內容並暗示了可推斷的資訊,從而對應於兩個核心能力:構圖與推理。然而,隨著T2I模型在超越構圖的推理能力上取得新進展,現有的基準測試在提供跨這些能力及各自內部的全面評估方面顯露出明顯的局限性。同時,這些進步也使模型能夠處理更複雜的提示,而當前的基準測試仍局限於低場景密度和簡化的一對一推理。為解決這些限制,我們提出了T2I-CoReBench,這是一個全面且複雜的基準測試,旨在評估T2I模型的構圖與推理能力。為確保全面性,我們圍繞場景圖元素(實例、屬性和關係)來組織構圖,並基於哲學推理框架(演繹、歸納和溯因)來組織推理,形成了一個12維的評估分類體系。為增加複雜性,受現實世界場景固有複雜性的驅動,我們為每個提示精心設計了高構圖密度以評估構圖能力,以及多步推理以評估推理能力。我們還為每個提示配備了一份檢查清單,其中列出了個別的“是/否”問題,以獨立評估每個預期元素,從而促進細緻且可靠的評估。統計數據顯示,我們的基準測試包含1,080個具有挑戰性的提示和約13,500個檢查清單問題。對27個當前T2I模型的實驗表明,在複雜的高密度場景中,它們的構圖能力仍然有限,而推理能力作為一個關鍵瓶頸更為落後,所有模型在從提示中推斷隱含元素時都面臨困難。我們的項目頁面:https://t2i-corebench.github.io/。
English
Text-to-image (T2I) generation aims to synthesize images from textual
prompts, which jointly specify what must be shown and imply what can be
inferred, thereby corresponding to two core capabilities: composition and
reasoning. However, with the emerging advances of T2I models in reasoning
beyond composition, existing benchmarks reveal clear limitations in providing
comprehensive evaluations across and within these capabilities. Meanwhile,
these advances also enable models to handle more complex prompts, whereas
current benchmarks remain limited to low scene density and simplified
one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a
comprehensive and complex benchmark that evaluates both composition and
reasoning capabilities of T2I models. To ensure comprehensiveness, we structure
composition around scene graph elements (instance, attribute, and relation) and
reasoning around the philosophical framework of inference (deductive,
inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To
increase complexity, driven by the inherent complexities of real-world
scenarios, we curate each prompt with high compositional density for
composition and multi-step inference for reasoning. We also pair each prompt
with a checklist that specifies individual yes/no questions to assess each
intended element independently to facilitate fine-grained and reliable
evaluation. In statistics, our benchmark comprises 1,080 challenging prompts
and around 13,500 checklist questions. Experiments across 27 current T2I models
reveal that their composition capability still remains limited in complex
high-density scenarios, while the reasoning capability lags even further behind
as a critical bottleneck, with all models struggling to infer implicit elements
from prompts. Our project page: https://t2i-corebench.github.io/.