ChatPaper.aiChatPaper

SePPO:半策略偏好优化用于扩散对齐

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

October 7, 2024
作者: Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo
cs.AI

摘要

人类反馈强化学习(RLHF)方法正逐渐成为微调扩散模型(DMs)用于视觉生成的一种方式。然而,常用的在线策略受限于奖励模型的泛化能力,而脱机策略则需要大量难以获取的配对人工注释数据,尤其是在视觉生成任务中。为了解决在线和脱机RLHF的限制,我们提出了一种偏好优化方法,该方法通过将DMs与偏好对齐,而无需依赖奖励模型或配对的人工注释数据。具体而言,我们引入了半策略偏好优化(SePPO)方法。SePPO利用先前的检查点作为参考模型,同时利用它们生成在线策略参考样本,这些样本取代了偏好对中的“失败图像”。这种方法使我们能够仅使用脱机的“获胜图像”进行优化。此外,我们设计了一种参考模型选择策略,以扩展策略空间中的探索。值得注意的是,我们并不简单地将参考样本视为学习的负面示例。相反,我们设计了一种基于锚点的标准,以评估参考样本是否可能是获胜或失败图像,使模型能够有选择地从生成的参考样本中学习。这种方法减轻了由于参考样本质量的不确定性而导致的性能下降。我们在文本到图像和文本到视频基准测试中验证了SePPO。SePPO在文本到图像基准测试中超越了所有先前的方法,并且在文本到视频基准测试中也表现出色。代码将在https://github.com/DwanZhang-AI/SePPO发布。
English
Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.

Summary

AI-Generated Summary

PDF52November 16, 2024