テキストから画像への拡散モデルのフリーランチアライメント：選好画像ペアを必要としないアプローチ

要旨

拡散モデルに基づくテキストから画像への変換（T2I）モデルの最近の進展により、テキストプロンプトから高品質な画像を生成することが可能になりました。しかし、最先端の拡散モデルにおいて、テキストと生成された画像の正確な整合性を確保することは依然として大きな課題です。この問題に対処するため、既存の研究では人間のフィードバックを用いた強化学習（RLHF）を活用し、T2Iの出力を人間の好みに合わせる取り組みが行われています。これらの手法は、ペア画像の選好データに直接依存するか、学習された報酬関数を必要とし、いずれも高品質な人間のアノテーションに大きく依存するため、スケーラビリティに制約があります。本研究では、ペア画像の選好データを必要とせずにT2Iモデルの整合性を実現する「無料の整合性」を可能にするText Preference Optimization（TPO）フレームワークを提案します。TPOは、大規模言語モデルを用いて元のキャプションを改変して作成された不一致プロンプトよりも一致プロンプトを選好するようにモデルを訓練することで機能します。本フレームワークは汎用的であり、既存の選好ベースのアルゴリズムと互換性があります。DPOとKTOを本設定に拡張し、TDPOとTKTOを実現しました。複数のベンチマークにおける定量的および定性的な評価により、提案手法が元の手法を一貫して上回り、人間の選好スコアとテキストから画像への整合性が向上することが示されました。オープンソースコードはhttps://github.com/DSL-Lab/T2I-Free-Lunch-Alignmentで公開しています。

English

Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.

テキストから画像への拡散モデルのフリーランチアライメント：選好画像ペアを必要としないアプローチ

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

要旨

Support