Pref-GRPO:基于成对偏好奖励的GRPO算法,用于稳定文本到图像的强化学习
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
August 28, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI
摘要
近期研究进展凸显了基于GRPO的强化学习方法及其基准测试在提升文本到图像(T2I)生成中的重要性。然而,当前采用逐点奖励模型(RM)对生成图像进行评分的方法易受奖励欺骗的影响。我们发现,当图像间微小的评分差异在归一化后被放大时,会产生虚假优势,驱使模型过度优化以追求微不足道的增益,最终导致图像生成过程不稳定。为解决这一问题,我们提出了Pref-GRPO,一种基于成对偏好奖励的GRPO方法,它将优化目标从分数最大化转向偏好拟合,确保了更稳定的训练过程。在Pref-GRPO中,图像在每组内通过偏好RM进行成对比较,并以胜率作为奖励信号。大量实验表明,PREF-GRPO能够区分细微的图像质量差异,提供更稳定的优势并减轻奖励欺骗。此外,现有的T2I基准测试受限于粗糙的评估标准,阻碍了对模型的全面评估。为此,我们引入了UniGenBench,一个统一的T2I基准测试,包含5大主题和20个子主题下的600个提示。它通过10个主要标准和27个子标准评估语义一致性,并利用MLLM进行基准构建与评估。我们的基准测试揭示了开源与闭源T2I模型的优缺点,并验证了Pref-GRPO的有效性。
English
Recent advancements highlight the importance of GRPO-based reinforcement
learning methods and benchmarking in enhancing text-to-image (T2I) generation.
However, current methods using pointwise reward models (RM) for scoring
generated images are susceptible to reward hacking. We reveal that this happens
when minimal score differences between images are amplified after
normalization, creating illusory advantages that drive the model to
over-optimize for trivial gains, ultimately destabilizing the image generation
process. To address this, we propose Pref-GRPO, a pairwise preference
reward-based GRPO method that shifts the optimization objective from score
maximization to preference fitting, ensuring more stable training. In
Pref-GRPO, images are pairwise compared within each group using preference RM,
and the win rate is used as the reward signal. Extensive experiments
demonstrate that PREF-GRPO differentiates subtle image quality differences,
providing more stable advantages and mitigating reward hacking. Additionally,
existing T2I benchmarks are limited by coarse evaluation criteria, hindering
comprehensive model assessment. To solve this, we introduce UniGenBench, a
unified T2I benchmark comprising 600 prompts across 5 main themes and 20
subthemes. It evaluates semantic consistency through 10 primary and 27
sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our
benchmarks uncover the strengths and weaknesses of both open and closed-source
T2I models and validate the effectiveness of Pref-GRPO.