ChatPaper.aiChatPaper

ComposeAnything:面向文本到图像生成的复合对象先验

ComposeAnything: Composite Object Priors for Text-to-Image Generation

May 30, 2025
作者: Zeeshan Khan, Shizhe Chen, Cordelia Schmid
cs.AI

摘要

從文本生成涉及複雜新穎物體排列的圖像,對於當前的文本到圖像(T2I)模型而言,仍然是一個重大挑戰。儘管先前基於佈局的方法利用二維佈局的空間約束改善了物體排列,但它們往往難以捕捉三維定位,並犧牲了質量和連貫性。在本研究中,我們引入了ComposeAnything,這是一個新穎的框架,旨在不重新訓練現有T2I模型的情況下,提升組合圖像的生成質量。我們的方法首先利用大型語言模型(LLMs)的鏈式思維推理能力,從文本中生成2.5D語義佈局,這些佈局包含帶有深度信息的二維物體邊界框以及詳細的標題。基於此佈局,我們生成一個空間和深度感知的物體粗略組合,捕捉到預期的構圖,作為一個強大且可解釋的先驗,替代了基於擴散的T2I模型中的隨機噪聲初始化。這一先驗通過物體先驗強化與空間控制的去噪過程引導去噪,實現了組合物體與連貫背景的無縫生成,同時允許對不準確的先驗進行細化。在T2I-CompBench和NSR-1K基準測試中,針對包含二維/三維空間排列、高物體數量及超現實構圖的提示,ComposeAnything超越了現有的最先進方法。人類評估進一步證明,我們的模型能夠生成高質量圖像,其構圖忠實地反映了文本內容。
English
Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.
PDF43June 3, 2025