領域認識二モーダル直接選好最適化による構成的テキスト画像生成

要旨

テキストから画像への生成（T2I）モデルの急速な進歩にもかかわらず、属性の結合、オブジェクト間の関係、計数などを含む複雑な構成のプロンプトを正確に反映した画像を生成することは依然として困難である。この課題に対処するため、我々はT2Iモデルの構成テキスト画像生成能力を強化するフレームワークであるBiDPOを提案する。まず、厳格な品質管理のもとで大規模な選好データセットBiCompを構築するための注意深く設計されたパイプラインを導入する。次に、Diffusion DPOを拡張し、画像とテキストの選好を同時に最適化する手法を提案する。この手法は、複雑なテキストプロンプトに従った生成においてモデルを大幅に改善する上で極めて有効であることが示されている。さらに、細粒度のアライメントを強化するため、構成概念に関連する領域に焦点を当てた領域レベルのガイダンス手法を採用する。実験結果は、我々のBiDPOが構成的一貫性を大幅に向上させ、複数のベンチマークにおいて従来手法を一貫して上回ることを示している。本アプローチは、複雑なテキスト画像生成タスクにおける選好ベースのファインチューニングの可能性を強調し、既存技術に代わる柔軟でスケーラブルな選択肢を提供するものである。

English

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.