直接選好最適化を用いた拡散モデルのアラインメント

要旨

大規模言語モデル（LLMs）は、人間の比較データを用いて「人間のフィードバックからの強化学習（RLHF）」手法でファインチューニングされ、ユーザーの嗜好により適合するように調整されています。これに対して、テキストから画像への拡散モデルにおける人間の嗜好学習は広く研究されておらず、現状では事前学習済みモデルを高品質な画像とキャプションで慎重に調整し、視覚的魅力とテキストの整合性を向上させる手法が最良とされています。本研究では、人間の比較データを直接最適化することで拡散モデルを人間の嗜好に適合させる手法「Diffusion-DPO」を提案します。Diffusion-DPOは、最近開発された「直接嗜好最適化（DPO）」を基にしています。DPOはRLHFの代替として、分類目的の下で人間の嗜好を最も満たすポリシーを直接最適化する簡潔な手法です。本手法では、DPOを拡散モデルの尤度概念に適合させるため、証拠下界を利用して微分可能な目的関数を導出します。851Kのクラウドソーシングによるペアワイズ嗜好データセット「Pick-a-Pic」を用いて、最先端のStable Diffusion XL（SDXL）-1.0モデルのベースモデルをDiffusion-DPOでファインチューニングしました。その結果、ファインチューニングされたベースモデルは、ベースSDXL-1.0および追加の精緻化モデルを含む大規模SDXL-1.0モデルを人間評価において大幅に上回り、視覚的魅力とプロンプトの整合性が向上しました。また、人間の嗜好データと同等の性能を発揮するAIフィードバックを用いたバリアントも開発し、拡散モデルの適合手法のスケーリングへの道を開きました。

English

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.

直接選好最適化を用いた拡散モデルのアラインメント

Diffusion Model Alignment Using Direct Preference Optimization

要旨

Support