Pref-GRPO:基於成對偏好獎勵的GRPO算法,用於穩定的文本到圖像強化學習
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
August 28, 2025
作者: Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
cs.AI
摘要
近期研究凸顯了基於GRPO的強化學習方法與基準測試在提升文本到圖像(T2I)生成中的重要性。然而,當前使用點式獎勵模型(RM)對生成圖像進行評分的方法容易受到獎勵欺騙的影響。我們揭示,當圖像間微小的分數差異在標準化後被放大,會產生虛假的優勢,驅使模型過度優化以追求微不足道的增益,最終導致圖像生成過程不穩定。為解決此問題,我們提出了Pref-GRPO,這是一種基於成對偏好獎勵的GRPO方法,它將優化目標從分數最大化轉向偏好擬合,確保更穩定的訓練。在Pref-GRPO中,圖像在每組內使用偏好RM進行成對比較,並以勝率作為獎勵信號。大量實驗表明,PREF-GRPO能夠區分圖像質量的細微差異,提供更穩定的優勢並緩解獎勵欺騙。此外,現有的T2I基準測試受限於粗糙的評估標準,阻礙了模型的全面評估。為此,我們引入了UniGenBench,這是一個統一的T2I基準測試,包含5個主要主題和20個子主題下的600個提示。它通過10個主要和27個子標準評估語義一致性,並利用MLLM進行基準構建與評估。我們的基準測試揭示了開源與閉源T2I模型的優缺點,並驗證了Pref-GRPO的有效性。
English
Recent advancements highlight the importance of GRPO-based reinforcement
learning methods and benchmarking in enhancing text-to-image (T2I) generation.
However, current methods using pointwise reward models (RM) for scoring
generated images are susceptible to reward hacking. We reveal that this happens
when minimal score differences between images are amplified after
normalization, creating illusory advantages that drive the model to
over-optimize for trivial gains, ultimately destabilizing the image generation
process. To address this, we propose Pref-GRPO, a pairwise preference
reward-based GRPO method that shifts the optimization objective from score
maximization to preference fitting, ensuring more stable training. In
Pref-GRPO, images are pairwise compared within each group using preference RM,
and the win rate is used as the reward signal. Extensive experiments
demonstrate that PREF-GRPO differentiates subtle image quality differences,
providing more stable advantages and mitigating reward hacking. Additionally,
existing T2I benchmarks are limited by coarse evaluation criteria, hindering
comprehensive model assessment. To solve this, we introduce UniGenBench, a
unified T2I benchmark comprising 600 prompts across 5 main themes and 20
subthemes. It evaluates semantic consistency through 10 primary and 27
sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our
benchmarks uncover the strengths and weaknesses of both open and closed-source
T2I models and validate the effectiveness of Pref-GRPO.