PRDP：用於大規模獎勵微調擴散模型的近端獎勵差異預測

摘要

獎勵微調已成為對齊基礎模型與下游目標的一種有前途的方法。在語言領域中，通過使用強化學習（RL）來最大化反映人類偏好的獎勵，取得了顯著成功。然而，在視覺領域中，現有基於RL的獎勵微調方法受到大規模訓練中的不穩定性的限制，使其無法推廣到複雜的未知提示。本文提出了Proximal Reward Difference Prediction（PRDP），首次在具有超過100K提示的大規模提示數據集上實現了穩定的黑盒獎勵微調擴散模型。我們的關鍵創新是Reward Difference Prediction（RDP）目標，它與RL目標具有相同的最優解，同時具有更好的訓練穩定性。具體來說，RDP目標是一個監督回歸目標，要求擴散模型預測由其去噪軌跡生成的圖像對的獎勵差異。我們在理論上證明了獲得完美獎勵差異預測的擴散模型正是RL目標的最大化者。我們進一步開發了一種具有近端更新的在線算法，以穩定地優化RDP目標。在實驗中，我們展示了PRDP在小規模訓練中可以匹配已建立的基於RL的方法的獎勵最大化能力。此外，通過對Human Preference Dataset v2和Pick-a-Pic v1數據集中的文本提示進行大規模訓練，PRDP在各種複雜的未知提示上實現了優越的生成質量，而基於RL的方法完全失敗。

English

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP：用於大規模獎勵微調擴散模型的近端獎勵差異預測

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

摘要

Support