생각보다 쉬운 그림 그리기: 텍스트-이미지 모델은 무대를 설정할 수 있지만 연극을 연출할 수는 있을까?

초록

텍스트-이미지(T2I) 생성은 텍스트 프롬프트에서 이미지를 합성하는 것을 목표로 하며, 이 프롬프트는 무엇을 보여줘야 하는지를 명시하고 무엇을 추론할 수 있는지를 암시함으로써 두 가지 핵심 능력인 구성(composition)과 추론(reasoning)에 대응합니다. 그러나 T2I 모델의 추론 능력이 구성 능력을 넘어서는 최신 발전에도 불구하고, 기존 벤치마크는 이러한 능력들 간 및 내부에서 포괄적인 평가를 제공하는 데 명확한 한계를 드러냅니다. 동시에, 이러한 발전은 모델이 더 복잡한 프롬프트를 처리할 수 있게 하지만, 현재 벤치마크는 낮은 장면 밀도와 단순화된 일대일 추론에 머물러 있습니다. 이러한 한계를 해결하기 위해, 우리는 T2I 모델의 구성과 추론 능력을 모두 평가하는 포괄적이고 복잡한 벤치마크인 T2I-CoReBench를 제안합니다. 포괄성을 보장하기 위해, 우리는 구성을 장면 그래프 요소(인스턴스, 속성, 관계)를 중심으로 구조화하고, 추론은 철학적 추론 프레임워크(연역적, 귀납적, 귀추적)를 중심으로 구조화하여 12차원 평가 분류 체계를 수립했습니다. 복잡성을 높이기 위해, 우리는 현실 세계 시나리오의 고유한 복잡성을 기반으로 각 프롬프트를 높은 구성 밀도와 다단계 추론을 포함하도록 큐레이션했습니다. 또한, 각 프롬프트에 세부적이고 신뢰할 수 있는 평가를 용이하게 하기 위해 개별 예/아니오 질문으로 구성된 체크리스트를 함께 제공하여 각 의도된 요소를 독립적으로 평가할 수 있도록 했습니다. 통계적으로, 우리의 벤치마크는 1,080개의 도전적인 프롬프트와 약 13,500개의 체크리스트 질문으로 구성됩니다. 27개의 최신 T2I 모델을 대상으로 한 실험 결과, 이들의 구성 능력은 여전히 복잡한 고밀도 시나리오에서 제한적이며, 추론 능력은 더욱 뒤처져 중요한 병목 현상으로 작용하며, 모든 모델이 프롬프트에서 암시적 요소를 추론하는 데 어려움을 겪는 것으로 나타났습니다. 우리의 프로젝트 페이지: https://t2i-corebench.github.io/.

English

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.

생각보다 쉬운 그림 그리기: 텍스트-이미지 모델은 무대를 설정할 수 있지만 연극을 연출할 수는 있을까?

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

초록

Support