ComposeAnything: 텍스트-이미지 생성을 위한 복합 객체 사전 정보

초록

복잡하고 새로운 객체 배치를 포함하는 텍스트에서 이미지를 생성하는 것은 현재의 텍스트-이미지(T2I) 모델들에게 여전히 큰 도전 과제로 남아 있습니다. 이전의 레이아웃 기반 방법들은 2D 레이아웃을 사용하여 공간적 제약을 통해 객체 배치를 개선했지만, 3D 위치 파악에는 어려움을 겪으며 품질과 일관성을 희생하는 경우가 많았습니다. 본 연구에서는 기존 T2I 모델을 재학습하지 않고도 구성적 이미지 생성을 개선하기 위한 새로운 프레임워크인 ComposeAnything을 소개합니다. 우리의 접근 방식은 먼저 대형 언어 모델(LLM)의 사고 연쇄 추론 능력을 활용하여 텍스트로부터 2.5D 의미론적 레이아웃을 생성합니다. 이 레이아웃은 깊이 정보와 상세한 캡션으로 보강된 2D 객체 경계 상자로 구성됩니다. 이 레이아웃을 기반으로, 의도된 구성을 포착하는 공간 및 깊이 인식의 거친 객체 합성을 생성하여, 확산 기반 T2I 모델에서의 확률적 노이즈 초기화를 대체하는 강력하고 해석 가능한 사전 정보로 사용합니다. 이 사전 정보는 객체 사전 강화와 공간 제어 노이즈 제거를 통해 노이즈 제거 과정을 안내하여, 구성적 객체와 일관된 배경을 원활하게 생성할 수 있게 하며, 부정확한 사전 정보를 개선할 수 있도록 합니다. ComposeAnything은 2D/3D 공간 배치, 높은 객체 수, 초현실적 구성을 포함하는 프롬프트에 대해 T2I-CompBench 및 NSR-1K 벤치마크에서 최신 방법들을 능가합니다. 인간 평가를 통해 우리 모델이 텍스트를 충실히 반영하는 구성으로 고품질 이미지를 생성한다는 것을 추가로 입증하였습니다.

English

Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

ComposeAnything: 텍스트-이미지 생성을 위한 복합 객체 사전 정보

ComposeAnything: Composite Object Priors for Text-to-Image Generation

초록

Support