ChatPaper.aiChatPaper

绘画易,构思难:文本到图像模型能否搭建舞台,却无法执导剧目?

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

September 3, 2025
作者: Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Fuli Feng
cs.AI

摘要

文本到图像(T2I)生成旨在根据文本提示合成图像,这些提示共同指定了必须展示的内容并暗示了可推断的信息,从而对应着两项核心能力:组合与推理。然而,随着T2I模型在超越组合的推理能力上取得的新进展,现有基准测试在全面评估这些能力之间及内部方面显示出明显局限。同时,这些进步也使模型能够处理更复杂的提示,而当前的基准测试仍局限于低场景密度和简化的一对一推理。为应对这些局限,我们提出了T2I-CoReBench,一个全面且复杂的基准测试,用于评估T2I模型的组合与推理能力。为确保全面性,我们围绕场景图元素(实例、属性和关系)构建组合评估,并基于哲学推理框架(演绎、归纳和溯因)构建推理评估,形成12维评估分类体系。为提升复杂性,受现实世界场景内在复杂性的驱动,我们精心设计每个提示,使其在组合方面具有高密度,在推理方面涉及多步推断。此外,我们为每个提示配备了一份检查清单,列出独立的“是/否”问题,以独立评估每个预期元素,从而支持细粒度且可靠的评估。统计上,我们的基准测试包含1,080个挑战性提示及约13,500个检查清单问题。对27个当前T2I模型的实验表明,在复杂高密度场景中,它们的组合能力仍显不足,而推理能力作为关键瓶颈更为滞后,所有模型在从提示中推断隐含元素时均面临困难。项目页面:https://t2i-corebench.github.io/。
English
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.
PDF102September 9, 2025