ChatPaper.aiChatPaper

扩散模型直接偏好优化对齐方法 (注:该翻译保持了专业术语的准确性:"Diffusion Model"译为"扩散模型","Direct Preference Optimization"译为"直接偏好优化","Alignment"在AI领域常译为"对齐"。标题采用中文论文常见的"方法"作为结尾,符合学术表达习惯。)

Diffusion Model Alignment Using Direct Preference Optimization

November 21, 2023
作者: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.AI

摘要

大型语言模型(LLMs)通常采用基于人类偏好比较数据的强化学习人类反馈(RLHF)方法进行微调,以更好地契合用户偏好。与LLMs相比,人类偏好学习在文生图扩散模型中的探索尚不广泛;现有最佳方法是通过精心筛选的高质量图像-文本对微调预训练模型,以提升视觉吸引力与文本对齐度。我们提出Diffusion-DPO方法,通过直接基于人类比较数据优化来实现扩散模型与人类偏好的对齐。该方法借鉴了新近发展的直接偏好优化(DPO)——一种比RLHF更简洁的替代方案,其通过分类目标直接优化最符合人类偏好的策略。我们重新构建了DPO框架以兼容扩散模型的似然概念,利用证据下界推导出可微优化目标。基于包含85.1万条众包成对偏好的Pick-a-Pic数据集,我们对最先进的Stable Diffusion XL(SDXL)-1.0基础模型进行Diffusion-DPO微调。微调后的基础模型在人类评估中显著优于原始SDXL-1.0基础模型及包含额外优化模块的更大规模SDXL-1.0模型,在视觉吸引力和提示词对齐度上均有提升。我们还开发了采用AI反馈的变体模型,其性能与基于人类偏好的训练相当,为扩散模型对齐方法的规模化扩展开辟了新路径。
English
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.
PDF493February 8, 2026