使用直接偏好優化進行擴散模型對齊
Diffusion Model Alignment Using Direct Preference Optimization
November 21, 2023
作者: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.AI
摘要
大型語言模型(LLMs)通常使用人類比較數據進行微調,並利用人類反饋強化學習(RLHF)方法,以使其更符合用戶偏好。與LLMs相反,在文本到圖像擴散模型中,人類偏好學習尚未被廣泛探索;目前最佳方法是使用精心策劃的高質量圖像和標題對預訓練模型進行微調,以提高視覺吸引力和文本對齊性。我們提出了Diffusion-DPO,一種通過直接優化人類比較數據來使擴散模型與人類偏好一致的方法。Diffusion-DPO是從最近開發的直接偏好優化(DPO)中改編而來,DPO是RLHF的一種簡單替代方案,直接優化最符合人類偏好的策略,以滿足分類目標。我們重新制定了DPO,以考慮擴散模型的可能性概念,利用證據下界來推導出可微分的目標。通過使用851K個眾包成對偏好的Pick-a-Pic數據集,我們使用Diffusion-DPO對最先進的Stable Diffusion XL(SDXL)-1.0模型的基本模型進行微調。我們微調的基本模型在人類評估中明顯優於基本SDXL-1.0模型和包含額外改進模型的更大SDXL-1.0模型,提高了視覺吸引力和提示對齊性。我們還開發了一種使用AI反饋的變體,其性能與基於人類偏好訓練相當,為擴散模型對齊方法的擴展打開了大門。
English
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users' preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods.Summary
AI-Generated Summary