ChatPaper.aiChatPaper

掌握文本到圖像擴散:多模態LLM的重新標題、規劃和生成

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

January 22, 2024
作者: Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui
cs.AI

摘要

擴散模型在文本到圖像生成和編輯方面表現出色。然而,現有方法在處理涉及多個物件、多個屬性和關係的複雜文本提示時往往面臨挑戰。本文提出了一種全新的無需訓練的文本到圖像生成/編輯框架,名為Recaption、Plan and Generate(RPG),利用多模態LLM的強大思維鏈推理能力來增強文本到圖像擴散模型的組成性。我們的方法將MLLM作為全局規劃器,將生成複雜圖像的過程分解為子區域內的多個更簡單的生成任務。我們提出了補充性區域擴散,以實現區域化的組成生成。此外,我們以閉環方式將文本引導的圖像生成和編輯整合到提出的RPG中,從而增強泛化能力。大量實驗表明,我們的RPG在多類別物件組合和文本-圖像語義對齊方面優於最先進的文本到圖像擴散模型,包括DALL-E 3和SDXL。值得注意的是,我們的RPG框架與各種MLLM架構(例如MiniGPT-4)和擴散骨幹(例如ControlNet)具有廣泛的兼容性。我們的代碼可在以下鏈接找到:https://github.com/YangLing0818/RPG-DiffusionMaster
English
Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
PDF312December 15, 2024