TempFlow-GRPO：時序在流模型中的GRPO重要性

摘要

近期，用於文本到圖像生成的流匹配模型已取得顯著質量，然而它們與強化學習的整合以實現人類偏好對齊仍不盡理想，阻礙了基於細粒度獎勵的優化。我們觀察到，流模型在有效GRPO訓練中的關鍵障礙在於現有方法中的時間均勻性假設：稀疏的終端獎勵與均勻的信用分配未能捕捉到生成時間步中決策的變化重要性，導致探索效率低下和收斂次優。為彌補這一不足，我們引入了TempFlow-GRPO（時間流GRPO），這是一個原則性的GRPO框架，能夠捕捉並利用基於流生成中固有的時間結構。TempFlow-GRPO引入了兩項關鍵創新：(i) 一種軌跡分支機制，通過在指定分支點集中隨機性來提供過程獎勵，無需專門的中間獎勵模型即可實現精確的信用分配；(ii) 一種噪聲感知的加權方案，根據每個時間步的內在探索潛力調節策略優化，優先在高影響的早期階段進行學習，同時確保後期階段的穩定改進。這些創新賦予了模型時間感知的優化能力，尊重了底層生成動力學，從而在人類偏好對齊和標準文本到圖像基準測試中實現了最先進的性能。

English

Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce TempFlow-GRPO (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

TempFlow-GRPO：時序在流模型中的GRPO重要性

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

摘要

Support