ChatPaper.aiChatPaper

透過建模基於流程的GRPO中逐步與長期抽樣效應來緩解稀疏獎勵問題

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

February 6, 2026
作者: Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang
cs.AI

摘要

在流匹配模型上部署GRPO已被證實能有效提升文字生成影像的效能。然而現有範式通常將基於結果的獎勵傳播至所有前置去噪步驟,卻未區分每個步驟的局部影響。此外,當前群組排序方法主要對比匹配時間步上的軌跡,忽略了軌跡內部的依賴關係——某些早期去噪動作可能透過延遲的隱性互動影響後續狀態。我們提出TurningPoint-GRPO(TP-GRPO),此GRPO框架能緩解逐步獎勵稀疏性,並顯式建模去噪軌跡中的長期效應。TP-GRPO具備兩大關鍵創新:(i)以步驟級增量獎勵取代結果導向獎勵,提供密集且具步驟感知的學習信號,更好隔離每個去噪動作的「純粹」效應;(ii)識別轉折點——即翻轉局部獎勵趨勢並使後續獎勵演變與整體軌跡趨勢一致的步驟——並為這些動作分配聚合長期獎勵以捕捉其延遲影響。轉折點僅透過增量獎勵的符號變化檢測,使TP-GRPO兼具高效與超參數自由特性。大量實驗也證明TP-GRPO能更有效利用獎勵信號,持續提升生成品質。演示程式碼請見:https://github.com/YunzeTong/TurningPoint-GRPO。
English
Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.
PDF412February 11, 2026