擴散模型對齊技術:基於直接偏好優化的方法 (注:此標題採用學術論文常見的譯法,將"Alignment"譯為專業術語「對齊」,"Direct Preference Optimization"直譯為「直接偏好優化」以保持技術準確性,同時通過添加「技術」和「方法」使中文標題符合學術表達習慣)
Diffusion Model Alignment Using Direct Preference Optimization
November 21, 2023
作者: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.AI
摘要
大型語言模型(LLM)通常採用基於人類比對資料的強化學習人類回饋(RLHF)方法進行微調,以使其更符合使用者偏好。與LLM相比,人類偏好學習在文字生成圖像的擴散模型中尚未被廣泛探索;現有最佳方法是使用精心篩選的高品質圖像及對應標題來微調預訓練模型,以提升視覺吸引力與文字對齊度。我們提出Diffusion-DPO方法,透過直接對人類比對資料進行最佳化,使擴散模型與人類偏好對齊。該方法改編自近期發展的直接偏好最佳化(DPO)技術——作為RLHF的更簡潔替代方案,DPO能直接基於分類目標最佳化最符合人類偏好的策略。我們重新推導DPO框架以適應擴散模型的機率特性,利用證據下界推導出可微分的目標函數。採用包含85.1萬組群眾外包兩兩偏好資料的Pick-a-Pic數據集,我們對最先進的Stable Diffusion XL(SDXL)-1.0基礎模型進行Diffusion-DPO微調。微調後的基礎模型在人工評估中顯著超越原始SDXL-1.0基礎模型及包含額外精煉模組的更大規模SDXL-1.0模型,有效提升視覺吸引力與提示詞對齊度。我們還開發了使用AI回饋的變體模型,其表現與基於人類偏好的訓練相當,為擴散模型對齊方法的規模化應用開闢了新途徑。
English
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users' preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods.