DenseGRPO：面向流匹配模型对齐的稀疏到稠密奖励优化方法

摘要

近期基于GRPO（生成式强化策略优化）的流匹配模型方法在文本到图像生成的人类偏好对齐方面展现出显著进步。然而，这些方法仍存在稀疏奖励问题：整个去噪轨迹的终端奖励被均摊至所有中间步骤，导致全局反馈信号与各去噪步骤的实际细粒度贡献不匹配。为解决该问题，我们提出DenseGRPO创新框架，通过密集奖励机制实现人类偏好对齐，可评估每个去噪步骤的细粒度贡献。具体而言，我们的方法包含两个核心组件：（1）提出通过基于常微分方程的方法对中间清晰图像施加奖励模型，预测逐步奖励增益作为各去噪步骤的密集奖励。这种方式确保反馈信号与单步贡献精确匹配，从而提升训练效率；（2）基于估计的密集奖励，我们发现现有GRPO方法中均匀探索设置与时变噪声强度存在匹配缺陷，导致探索空间失当。因此，我们提出奖励感知方案，通过自适应调整SDE采样器中针对特定时间步的随机性注入来校准探索空间，确保所有时间步均具有适宜的探索范围。在多个标准基准上的大量实验验证了DenseGRPO的有效性，并突显了有效密集奖励在流匹配模型对齐中的关键作用。

English

Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce DenseGRPO, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.

DenseGRPO：面向流匹配模型对齐的稀疏到稠密奖励优化方法

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

摘要

Support