ChatPaper.aiChatPaper

G^2RPO:面向流模型精確獎勵的粒度化GRPO

G^2RPO: Granular GRPO for Precise Reward in Flow Models

October 2, 2025
作者: Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai
cs.AI

摘要

將線上強化學習(RL)整合至擴散與流動模型之中,近期已成為一種極具前景的方法,旨在使生成模型與人類偏好對齊。在去噪過程中,通過隨機微分方程(SDE)進行隨機採樣,以生成多樣化的去噪方向供RL探索。儘管現有方法能有效探索潛在的高價值樣本,但由於獎勵信號稀疏且狹窄,導致偏好對齊效果欠佳。為應對這些挑戰,我們提出了一種新穎的細粒度GRPO(G^2RPO)框架,該框架在流動模型的強化學習中實現了對採樣方向的精確且全面的獎勵評估。具體而言,引入了一種奇異隨機採樣策略,以支持逐步的隨機探索,同時強化獎勵與注入噪聲之間的高度相關性,從而為每次SDE擾動提供忠實的獎勵。同時,為消除固定粒度去噪中固有的偏差,我們引入了一個多粒度優勢集成模塊,該模塊聚合了在多個擴散尺度上計算的優勢,從而對採樣方向進行更全面且穩健的評估。在包括域內和域外評估在內的各種獎勵模型上進行的實驗表明,我們的G^2RPO顯著優於現有的基於流動的GRPO基線,凸顯了其有效性和穩健性。
English
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G^2RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G^2RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
PDF52October 9, 2025