ChatPaper.aiChatPaper

萬物各得其所:文字生成圖像模型的空間智能基準測試

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

January 28, 2026
作者: Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, Xiangxiang Chu
cs.AI

摘要

文字生成圖像(T2I)模型在生成高擬真度影像方面已取得顯著成功,但在處理複雜空間關係(如空間感知、推理或互動)時往往表現不佳。由於現有基準測試的提示設計過於簡短或資訊稀疏,這些關鍵面向長期被忽視。本文提出SpatialGenEval——一個系統性評估T2I模型空間智能的新基準,涵蓋兩大核心維度:(1)該基準包含25個真實場景中的1,230條長文本、高資訊密度的提示詞,每條提示詞整合10個空間子領域及其對應的10組選擇題問答對,內容橫跨物體位置、佈局、遮擋關係到因果推理。我們對21個前沿模型的廣泛評估表明,高階空間推理仍是主要瓶頸。(2)為驗證高資訊密度設計的價值不僅限於評估,我們同步構建SpatialT2I數據集,包含15,400個經改寫的文本-影像對,在保持資訊密度的同時確保影像一致性。在現有基礎模型(如Stable Diffusion-XL、Uniworld-V1、OmniGen2)上的微調實驗顯示出穩定的性能提升(+4.2%、+5.7%、+4.4%),並在空間關係呈現上產生更逼真的效果,彰顯了以數據為核心實現T2I模型空間智能的新範式。
English
Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
PDF993January 31, 2026