ChatPaper.aiChatPaper

DenseGRPO:面向流匹配模型对齐的从稀疏奖励到稠密奖励演进

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

January 28, 2026
作者: Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang
cs.AI

摘要

近期基于GRPO(生成式强化策略优化)的流匹配模型方法,在文本到图像生成的人类偏好对齐方面展现出显著提升。然而,这些方法仍存在稀疏奖励问题:整个去噪轨迹的终末奖励被均摊至所有中间步骤,导致全局反馈信号与各去噪步骤的实际细粒度贡献不匹配。为解决此问题,我们提出DenseGRPO创新框架,通过密集奖励机制实现人类偏好的细粒度对齐,逐步骤评估去噪过程的微观贡献。具体而言,我们的方法包含两个核心组件:(1)提出通过基于常微分方程的方法对中间清晰图像施加奖励模型,预测逐步骤奖励增益作为密集奖励。这种方式确保反馈信号与单步贡献精确匹配,从而提升训练效率;(2)基于估计的密集奖励,我们发现现有GRPO方法中均匀探索设置与时变噪声强度存在不匹配缺陷,导致探索空间失当。因此,我们提出奖励感知的随机微分方程采样器校准方案,通过自适应调整时序特异性随机注入量,确保所有时间步均具有适宜的探索空间。在多个标准基准上的大量实验证明了DenseGRPO的有效性,并凸显了有效密集奖励在流匹配模型对齐中的关键作用。
English
Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce DenseGRPO, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.
PDF142February 3, 2026