テキストから画像への拡散モデルのマスタリング：マルチモーダルLLMを用いた再キャプション、計画、生成

要旨

Diffusionモデルはテキストから画像への生成および編集において卓越した性能を示しています。しかし、既存の手法では、複数の属性や関係性を持つ複数のオブジェクトを含む複雑なテキストプロンプトを扱う際に課題に直面することが多いです。本論文では、マルチモーダルLLMの強力な連鎖思考推論能力を活用して、テキストから画像へのDiffusionモデルの構成性を向上させる、新たなトレーニング不要のテキストから画像生成/編集フレームワーク、Recaption, Plan and Generate (RPG)を提案します。我々のアプローチでは、MLLMをグローバルプランナーとして利用し、複雑な画像の生成プロセスをサブリージョン内の複数のより単純な生成タスクに分解します。また、リージョンごとの構成生成を可能にする補完的なリージョナルDiffusionを提案します。さらに、提案したRPG内でテキストガイド付き画像生成と編集を閉ループ方式で統合し、汎化能力を向上させます。広範な実験により、我々のRPGがDALL-E 3やSDXLを含む最先端のテキストから画像へのDiffusionモデルを凌駕し、特に多カテゴリオブジェクトの構成とテキスト-画像の意味的整合性において優れていることが示されました。特に、我々のRPGフレームワークは、MiniGPT-4などの様々なMLLMアーキテクチャやControlNetなどのDiffusionバックボーンとの広範な互換性を示しています。コードは以下で公開されています: https://github.com/YangLing0818/RPG-DiffusionMaster

English

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

テキストから画像への拡散モデルのマスタリング：マルチモーダルLLMを用いた再キャプション、計画、生成

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

要旨

Support