ComposeAnything: テキストから画像生成のための複合オブジェクト事前分布

要旨

複雑で新奇なオブジェクト配置を含むテキストからの画像生成は、現在のテキストから画像（T2I）モデルにとって依然として重要な課題である。従来のレイアウトベースの手法は、2Dレイアウトを用いた空間的制約によってオブジェクト配置を改善するが、3D位置の把握に苦戦し、品質と一貫性を犠牲にすることが多い。本研究では、既存のT2Iモデルを再学習することなく、構成的画像生成を改善するための新しいフレームワークであるComposeAnythingを提案する。我々のアプローチでは、まず大規模言語モデル（LLM）の連鎖的思考推論能力を活用して、テキストから2.5Dセマンティックレイアウトを生成する。このレイアウトは、深度情報と詳細なキャプションを付加した2Dオブジェクトバウンディングボックスで構成される。このレイアウトに基づいて、意図した構図を捉えた空間的および深度を考慮した粗いオブジェクトの合成物を生成し、拡散ベースのT2Iモデルにおける確率的ノイズ初期化を置き換える強力で解釈可能な事前情報として機能させる。この事前情報は、オブジェクト事前強化と空間制御されたノイズ除去を通じてノイズ除去プロセスを導き、構成的なオブジェクトと一貫性のある背景をシームレスに生成するとともに、不正確な事前情報の洗練を可能にする。ComposeAnythingは、2D/3D空間配置、多数のオブジェクト、およびシュールな構図を含むプロンプトに対して、T2I-CompBenchおよびNSR-1Kベンチマークにおいて最先端の手法を上回る性能を示す。人間による評価では、我々のモデルがテキストを忠実に反映した構図を持つ高品質な画像を生成することがさらに実証された。

English

Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

ComposeAnything: テキストから画像生成のための複合オブジェクト事前分布

ComposeAnything: Composite Object Priors for Text-to-Image Generation

要旨

Support