ChatPaper.aiChatPaper

G^2RPO:面向流模型的精细化GRPO精准奖励机制

G^2RPO: Granular GRPO for Precise Reward in Flow Models

October 2, 2025
作者: Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai
cs.AI

摘要

在线强化学习(RL)与扩散和流模型的整合,近期已成为一种颇具前景的方法,旨在使生成模型与人类偏好对齐。在去噪过程中,通过随机微分方程(SDE)进行随机采样,为RL探索生成多样化的去噪方向。尽管现有方法能有效探索潜在的高价值样本,但由于奖励信号稀疏且局限,导致偏好对齐效果欠佳。针对这些挑战,我们提出了一种新颖的细粒度GRPO(G^2RPO)框架,该框架在流模型的强化学习中实现了对采样方向的精确且全面的奖励评估。具体而言,引入了一种奇异随机采样策略,以支持逐步的随机探索,同时强化奖励与注入噪声之间的高度相关性,从而确保每次SDE扰动都能获得真实的奖励。同时,为了消除固定粒度去噪中固有的偏差,我们引入了多粒度优势集成模块,该模块聚合了在多个扩散尺度上计算的优势,从而对采样方向进行了更为全面和稳健的评估。在包括域内和域外评估在内的多种奖励模型上进行的实验表明,我们的G^2RPO显著优于现有的基于流的GRPO基线,凸显了其有效性和鲁棒性。
English
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G^2RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G^2RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
PDF52October 9, 2025