PRDP：用于大规模奖励微调扩散模型的近端奖励差异预测

摘要

奖励微调已成为将基础模型与下游目标对齐的一种有前途的方法。在语言领域，通过使用强化学习（RL）来最大化反映人类偏好的奖励，取得了显著成功。然而，在视觉领域，现有基于RL的奖励微调方法受到大规模训练中的不稳定性的限制，使其无法推广到复杂的未知提示。在本文中，我们提出了Proximal Reward Difference Prediction（PRDP），首次在拥有超过100K提示的大规模提示数据集上实现了扩散模型的稳定黑盒奖励微调。我们的关键创新是奖励差异预测（RDP）目标，它与RL目标具有相同的最优解，同时享有更好的训练稳定性。具体而言，RDP目标是一个监督回归目标，要求扩散模型预测由其去噪轨迹生成的图像对的奖励差异。我们在理论上证明，获得完美奖励差异预测的扩散模型恰好是RL目标的最大化者。我们进一步开发了一个带有近端更新的在线算法，以稳定优化RDP目标。在实验中，我们展示了PRDP在小规模训练中可以匹敌已建立的基于RL的方法的奖励最大化能力。此外，通过对Human Preference Dataset v2和Pick-a-Pic v1数据集中的文本提示进行大规模训练，PRDP在各种复杂的未知提示上实现了卓越的生成质量，而基于RL的方法则完全失败。

English

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP：用于大规模奖励微调扩散模型的近端奖励差异预测

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

摘要

Support