通过建模基于流的GRPO中逐步与长期采样效应缓解稀疏奖励问题
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO
February 6, 2026
作者: Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang
cs.AI
摘要
在流匹配模型上部署GRPO已被证明对文本到图像生成具有显著效果。然而,现有范式通常将基于结果的奖励传播至所有先前的去噪步骤,而未区分每个步骤的局部影响。此外,当前基于分组的排序方法主要比较相同时步的轨迹,却忽略了轨迹内部的依赖关系——某些早期去噪操作可能通过延迟的隐式交互影响后续状态。我们提出TurningPoint-GRPO(TP-GRPO),这一GRPO框架通过缓解逐步骤奖励稀疏性问题并显式建模去噪轨迹中的长期效应,实现了两大关键创新:(i)用步骤级增量奖励替代结果导向型奖励,提供密集且感知步骤的学习信号,从而更好隔离每个去噪动作的"纯粹"效应;(ii)识别转折点——即改变局部奖励趋势并使后续奖励演化与整体轨迹趋势一致的步骤——并为这些动作分配聚合的长期奖励以捕捉其延迟影响。转折点仅通过增量奖励的符号变化即可检测,使得TP-GRPO兼具高效性与超参数无关性。大量实验表明,TP-GRPO能更有效地利用奖励信号并持续提升生成质量。演示代码详见:https://github.com/YunzeTong/TurningPoint-GRPO。
English
Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.