ChatPaper.aiChatPaper

掌握文本到图像扩散:利用多模态LLMs进行重新标题、规划和生成

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

January 22, 2024
作者: Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui
cs.AI

摘要

扩散模型在文本到图像生成和编辑方面表现出色。然而,现有方法在处理涉及多个对象、多个属性和关系的复杂文本提示时通常面临挑战。本文提出了一种全新的无需训练的文本到图像生成/编辑框架,即Recaption,Plan and Generate(RPG),利用多模态LLM的强大链式推理能力,以增强文本到图像扩散模型的组合性。我们的方法将MLLM作为全局规划器,将生成复杂图像的过程分解为在子区域内执行多个更简单生成任务。我们提出了互补的区域扩散,以实现区域化的组合生成。此外,我们以闭环方式在提出的RPG中集成了文本引导的图像生成和编辑,从而增强了泛化能力。大量实验证明,我们的RPG在多类别对象组合和文本-图像语义对齐等方面优于DALL-E 3和SDXL等最先进的文本到图像扩散模型。值得注意的是,我们的RPG框架与各种MLLM架构(例如MiniGPT-4)和扩散骨干(例如ControlNet)具有广泛的兼容性。我们的代码可在以下链接找到:https://github.com/YangLing0818/RPG-DiffusionMaster
English
Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
PDF312December 15, 2024