V-GRPO：用於生成模型去噪的線上強化學習比你想像的更簡單

摘要

将去噪生成模型与人类偏好或可验证奖励对齐仍是一个关键挑战。虽然基于策略梯度的在线强化学习（RL）提供了原则性的训练后优化框架，但由于这类模型的似然函数难以处理，其直接应用受到阻碍。现有研究要么通过优化采样轨迹诱导的马尔可夫决策过程（MDP）——这种方法稳定但效率低下，要么采用基于扩散证据下界（ELBO）的似然替代函数——但目前在视觉生成任务中表现欠佳。我们的核心发现是：基于ELBO的方法实际上可以实现稳定与高效兼得。通过降低替代函数方差并控制梯度步长，我们证明该方法可以超越基于MDP的方法。为此，我们提出变分GRPO（V-GRPO），该方法将基于ELBO的替代函数与群组相对策略优化（GRPO）算法相结合，并辅以一系列简单而关键的技术。我们的方法易于实现，与预训练目标保持一致，且规避了基于MDP方法的局限性。在文生图任务中，V-GRPO实现了最先进的性能，同时相比MixGRPO提速2倍，较DiffusionNFT提速3倍。

English

Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a 2times speedup over MixGRPO and a 3times speedup over DiffusionNFT.

V-GRPO：用於生成模型去噪的線上強化學習比你想像的更簡單

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

摘要

Support