ChatPaper.aiChatPaper

使用直接偏好优化进行扩散模型对齐

Diffusion Model Alignment Using Direct Preference Optimization

November 21, 2023
作者: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.AI

摘要

大型语言模型(LLMs)通过使用人类比较数据和强化学习从人类反馈中进行微调,以使它们更好地与用户偏好保持一致。与LLMs相比,在文本到图像扩散模型中,人类偏好学习并没有得到广泛探讨;目前最佳的方法是使用精心策划的高质量图像和标题对预训练模型进行微调,以改善视觉吸引力和文本对齐。我们提出了扩散-DPO,这是一种通过直接优化人类比较数据来使扩散模型与人类偏好保持一致的方法。Diffusion-DPO是从最近开发的直接偏好优化(DPO)中改编而来,这是一种简化的替代方案,直接优化最符合人类偏好的策略,以满足分类目标。我们重新构建了DPO,以考虑扩散模型的似然概念,利用证据下界推导出一个可微分的目标。使用851K众包成对偏好的Pick-a-Pic数据集,我们使用Diffusion-DPO对最先进的稳定扩散XL(SDXL)-1.0模型的基础模型进行微调。我们微调的基础模型在人类评估中明显优于基础SDXL-1.0和包含额外改进模型的更大SDXL-1.0模型,提高了视觉吸引力和提示对齐。我们还开发了一种利用人工智能反馈的变体,并具有与基于人类偏好训练相当的性能,为扩展扩散模型对齐方法打开了大门。
English
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.

Summary

AI-Generated Summary

PDF503December 15, 2024