분할 정복: 언어 모델이 구성적 텍스트-이미지 생성을 위해 계획하고 자가 수정할 수 있다

초록

고품질 이미지 생성을 위한 텍스트-이미지 모델의 상당한 발전에도 불구하고, 이러한 방법들은 특히 복잡한 텍스트 프롬프트의 맥락에서 객체 속성과 관계를 유지하는 데 있어 텍스트 프롬프트의 이미지에 대한 제어 가능성을 보장하는 데 여전히 어려움을 겪고 있다. 본 논문에서는 대규모 언어 모델(LLM) 에이전트를 핵심으로 하는 훈련이 필요 없는 구성적 텍스트-이미지 생성 접근법인 CompAgent를 제안한다. CompAgent의 근본적인 아이디어는 분할 정복 방법론에 기초한다. 객체, 속성, 관계를 포함한 여러 개념을 담고 있는 복잡한 텍스트 프롬프트가 주어지면, LLM 에이전트는 이를 초기에 분해하여 개별 객체, 그와 관련된 속성, 그리고 일관된 장면 레이아웃의 예측을 추출한다. 이러한 개별 객체는 독립적으로 정복될 수 있다. 이후 에이전트는 텍스트를 분석하여 추론을 수행하고, 이러한 분리된 객체를 구성하기 위해 도구를 계획하고 사용한다. 검증 및 인간 피드백 메커니즘은 최종적으로 에이전트에 통합되어 잠재적인 속성 오류를 수정하고 생성된 이미지를 개선한다. LLM 에이전트의 지도 하에, 우리는 개념 구성을 위한 도구로 훈련이 필요 없는 다중 개념 맞춤화 모델과 레이아웃-이미지 생성 모델을 제안하며, 검증을 위해 에이전트와 상호작용할 수 있는 로컬 이미지 편집 방법을 도구로 제안한다. 장면 레이아웃은 이러한 도구들 간의 이미지 생성 과정을 제어하여 여러 객체 간의 혼란을 방지한다. 광범위한 실험은 구성적 텍스트-이미지 생성에 대한 우리의 접근법의 우수성을 입증한다: CompAgent는 오픈 월드 구성적 T2I 생성을 위한 포괄적인 벤치마크인 T2I-CompBench에서 10% 이상의 개선을 달성한다. 다양한 관련 작업으로의 확장은 또한 잠재적인 응용 프로그램을 위한 우리의 CompAgent의 유연성을 보여준다.

English

Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a large language model (LLM) agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.

분할 정복: 언어 모델이 구성적 텍스트-이미지 생성을 위해 계획하고 자가 수정할 수 있다

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

초록

Support