E-GRPO:高熵步进驱动流模型的有效强化学习
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
January 1, 2026
作者: Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan
cs.AI
摘要
近期强化学习技术提升了基于流匹配模型的人类偏好对齐方法。虽然随机采样能够探索去噪方向,但现有在多步去噪过程中进行优化的方法仍受限于稀疏且模糊的奖励信号。我们发现高熵步骤能实现更高效有效的探索,而低熵步骤则导致生成轨迹缺乏区分度。为此,我们提出E-GRPO(熵感知分组相对策略优化),通过提升随机微分方程采样步骤的熵值来改善此问题。由于多步随机性会导致随机微分方程积分过程中的奖励信号模糊,我们特别将连续的低熵步骤合并为单个高熵步骤进行SDE采样,其余步骤则采用ODE采样。基于此,我们进一步提出多步分组归一化优势函数,在共享同一合并SDE去噪步骤的样本组内计算相对优势。在不同奖励设置下的实验结果验证了本方法的有效性。
English
Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.