通过直接群体偏好优化强化扩散模型

摘要

尽管诸如群体相对偏好优化（GRPO）等强化学习方法已显著提升了大型语言模型的性能，但将其应用于扩散模型仍面临挑战。特别是，GRPO要求采用随机策略，而最具成本效益的扩散采样器却基于确定性常微分方程（ODE）。近期研究通过使用效率较低的基于随机微分方程（SDE）的采样器来引入随机性，但这种方法依赖于模型无关的高斯噪声，导致收敛速度缓慢。为解决这一矛盾，我们提出了直接群体偏好优化（DGPO），这是一种全新的在线强化学习算法，完全摒弃了策略梯度框架。DGPO直接从群体层面的偏好中学习，这些偏好利用了群体内样本的相对信息。这一设计消除了对低效随机策略的需求，从而能够使用高效的确定性ODE采样器，实现更快的训练速度。大量实验结果表明，DGPO的训练速度比现有最先进方法快约20倍，并在域内和域外奖励指标上均表现出更优的性能。代码已发布于https://github.com/Luo-Yihong/DGPO。

English

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

通过直接群体偏好优化强化扩散模型

Reinforcing Diffusion Models by Direct Group Preference Optimization

摘要

Support