IterComp：テキストから画像生成のためのモデルギャラリーからの反復的な合成意識フィードバック学習

要旨

RPG、Stable Diffusion 3、FLUXなどの高度な拡散モデルは、テキストから画像を生成する際に注目すべき進展を遂げています。しかしながら、これらの手法は通常、構成生成において異なる強みを示し、属性の結合を処理するのに優れたものと空間関係を扱うのに優れたものがあります。この格差は、さまざまなモデルの補完的な強みを活用して構成能力を包括的に向上させるアプローチの必要性を示しています。このため、我々はIterCompを導入し、複数のモデルから構成認識モデルの選好を集約し、反復的なフィードバック学習アプローチを用いて構成生成を強化します。具体的には、6つの強力なオープンソースの拡散モデルのギャラリーを編成し、属性の結合、空間関係、非空間関係という3つの主要な構成メトリクスを評価します。これらのメトリクスに基づき、構成認識モデル選好データセットを開発し、多数の画像ランクペアをトレーニングデータとして構成認識報酬モデルを訓練します。その後、基本的な拡散モデルと報酬モデルの両方を複数の反復を通じて逐次的に自己改良できるようにする反復的なフィードバック学習手法を提案します。理論的証明が効果を示し、広範な実験が、特に複数カテゴリのオブジェクト構成や複雑な意味的整合性において、以前のSOTA手法（例：OmostとFLUX）に比べて著しい優位性を示しています。IterCompは、拡散モデルと構成生成における報酬フィードバック学習の新たな研究分野を開拓します。コード：https://github.com/YangLing0818/IterComp

English

Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp

IterComp：テキストから画像生成のためのモデルギャラリーからの反復的な合成意識フィードバック学習

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

要旨

Support