分而治之：語言模型可以規劃和自我校正以進行組合式文本到圖像生成。

摘要

儘管在文本轉圖像模型方面取得了顯著進展，以生成高質量圖像，這些方法仍然難以確保在複雜文本提示的情況下對圖像的可控性，特別是在保留物件屬性和關係方面。在本文中，我們提出了CompAgent，這是一種無需訓練的組合式文本轉圖像生成方法，其核心是一個大型語言模型（LLM）代理。CompAgent的基本理念是基於分治方法。給定一個包含多個概念（包括物件、屬性和關係）的複雜文本提示，LLM代理首先將其分解，這包括提取單個物件、它們相關的屬性，以及預測一個連貫的場景佈局。這些單個物件然後可以獨立地被征服。隨後，代理通過分析文本進行推理，計劃並使用工具來組合這些孤立的物件。最後，我們的代理還將驗證和人類反饋機制納入其中，以進一步糾正潛在的屬性錯誤並完善生成的圖像。在LLM代理的引導下，我們提出了一種無需調整的多概念定制模型和一種從佈局到圖像生成的模型作為概念組合的工具，以及一種局部圖像編輯方法作為與代理進行驗證的工具。在這些工具中，場景佈局控制著圖像生成過程，以防止多個物件之間的混淆。大量實驗證明了我們的組合式文本轉圖像生成方法的優越性：CompAgent在T2I-CompBench上實現了超過10％的改進，這是一個開放世界組合式T2I生成的綜合基準。對各種相關任務的擴展也說明了我們的CompAgent對潛在應用的靈活性。

English

Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a large language model (LLM) agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.

分而治之：語言模型可以規劃和自我校正以進行組合式文本到圖像生成。

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

摘要

Support