ComposeAnything：面向文本到图像生成的复合对象先验

摘要

从涉及复杂新颖物体排列的文本生成图像，对于当前的文本到图像（T2I）模型而言，仍是一项重大挑战。尽管先前的基于布局的方法通过二维布局的空间约束改进了物体排列，但它们往往难以捕捉三维定位，并牺牲了图像质量和连贯性。在本研究中，我们提出了ComposeAnything，一种无需重新训练现有T2I模型即可提升组合图像生成效果的新颖框架。我们的方法首先利用大语言模型（LLMs）的链式思维推理能力，从文本中生成2.5D语义布局，该布局包含带有深度信息的二维物体边界框及详细描述。基于此布局，我们生成一个空间和深度感知的物体粗略组合，捕捉预期构图，作为扩散型T2I模型中替代随机噪声初始化的强有力且可解释的先验。这一先验通过物体先验强化和空间控制去噪引导去噪过程，实现组合物体与背景的无缝生成，同时允许对不准确的先验进行优化。在T2I-CompBench和NSR-1K基准测试中，针对包含2D/3D空间排列、高物体数量及超现实构图的提示，ComposeAnything超越了现有最先进方法。人类评估进一步证实，我们的模型生成的图像质量高，其构图忠实地反映了文本内容。

English

Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

ComposeAnything：面向文本到图像生成的复合对象先验

ComposeAnything: Composite Object Priors for Text-to-Image Generation

摘要

Support