E-GRPO:高熵步驟驅動流模型的有效強化學習
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
January 1, 2026
作者: Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan
cs.AI
摘要
近期強化學習技術在人類偏好對齊方面提升了流體匹配模型的效能。隨機抽樣雖能探索去噪方向,但現有基於多步去噪優化的方法卻面臨獎勵信號稀疏與模糊的問題。我們觀察到高熵步驟能實現更高效且有效的探索,而低熵步驟則會產生無差異化的軌跡。為此,我們提出E-GRPO(熵感知群組相對策略優化)方法,旨在提升隨機微分方程抽樣步驟的熵值。由於隨機微分方程的積分過程會因多步隨機性導致獎勵信號模糊,我們特別合併連續的低熵步驟以構建單一高熵的SDE抽樣步驟,同時在其他步驟採用常微分方程抽樣。基於此架構,我們進一步提出多步群組歸一化優勢函數,可在共享同一合併SDE去噪步驟的樣本群組內計算相對優勢。在不同獎勵設定下的實驗結果驗證了本方法的有效性。
English
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.