Pref-GRPO: 안정적인 텍스트-이미지 강화 학습을 위한 쌍별 선호도 기반 보상 GRPO

초록

최근의 발전은 텍스트-이미지(T2I) 생성 향상에 있어 GRPO 기반 강화 학습 방법과 벤치마킹의 중요성을 강조합니다. 그러나 현재 생성된 이미지에 점수를 매기기 위해 점수 기반 보상 모델(RM)을 사용하는 방법은 보상 해킹에 취약합니다. 우리는 이러한 문제가 이미지 간의 미미한 점수 차이가 정규화 후 과장되어, 모델이 사소한 이득을 위해 과도하게 최적화되도록 유도하면서 결국 이미지 생성 과정을 불안정하게 만드는 가짜 이점을 생성할 때 발생한다는 것을 밝혔습니다. 이를 해결하기 위해, 우리는 점수 최대화에서 선호도 맞춤으로 최적화 목표를 전환하여 더 안정적인 학습을 보장하는 쌍별 선호도 기반 GRPO 방법인 Pref-GRPO를 제안합니다. Pref-GRPO에서는 각 그룹 내에서 이미지를 쌍별로 비교하고 선호도 RM을 사용하여 승률을 보상 신호로 사용합니다. 광범위한 실험을 통해 Pref-GRPO가 미묘한 이미지 품질 차이를 구별하고 더 안정적인 이점을 제공하며 보상 해킹을 완화한다는 것을 입증했습니다. 또한, 기존의 T2I 벤치마크는 거친 평가 기준으로 인해 포괄적인 모델 평가를 방해합니다. 이를 해결하기 위해, 우리는 5개의 주요 주제와 20개의 하위 주제로 구성된 600개의 프롬프트를 포함하는 통합 T2I 벤치마크인 UniGenBench을 소개합니다. 이 벤치마크는 10개의 주요 기준과 27개의 하위 기준을 통해 의미적 일관성을 평가하며, 벤치마크 구성과 평가를 위해 MLLM을 활용합니다. 우리의 벤치마크는 오픈소스와 클로즈드소스 T2I 모델의 강점과 약점을 밝히고 Pref-GRPO의 효과를 검증합니다.

English

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

Pref-GRPO: 안정적인 텍스트-이미지 강화 학습을 위한 쌍별 선호도 기반 보상 GRPO

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

초록

Support