ChatPaper.aiChatPaper

TempFlow-GRPO:时序对流动模型中GRPO的重要性

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

August 6, 2025
作者: Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang
cs.AI

摘要

近期在文本到图像生成领域的流匹配模型已取得了显著的质量提升,然而,它们在结合强化学习以实现人类偏好对齐方面仍显不足,这阻碍了基于细粒度奖励的优化。我们观察到,流模型在有效进行GRPO(梯度奖励策略优化)训练时的主要障碍在于现有方法中的时间均匀性假设:稀疏的终端奖励与均匀的信用分配无法捕捉生成时间步中决策的关键性变化,导致探索效率低下和收敛效果欠佳。为弥补这一缺陷,我们提出了TempFlow-GRPO(时序流GRPO),这是一个原则性的GRPO框架,旨在捕捉并利用基于流生成中固有的时序结构。TempFlow-GRPO引入了两大创新点:(i) 轨迹分支机制,通过在指定分支点集中随机性来提供过程奖励,无需专门的中间奖励模型即可实现精确的信用分配;(ii) 噪声感知权重方案,根据每个时间步的内在探索潜力调整策略优化,优先在高影响力的早期阶段进行学习,同时确保后期阶段的稳定优化。这些创新使模型具备了尊重底层生成动态的时序感知优化能力,从而在人类偏好对齐和标准文本到图像生成基准测试中达到了业界领先的性能。
English
Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce TempFlow-GRPO (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.
PDF61August 20, 2025