Pref-GRPO: 安定したテキストから画像への強化学習のためのペアワイズ選好報酬ベースGRPO

要旨

近年の進歩は、テキストから画像（T2I）生成を強化する上で、GRPOベースの強化学習手法とベンチマークの重要性を浮き彫りにしています。しかし、生成された画像をスコアリングするためにポイントワイズ報酬モデル（RM）を使用する現在の手法は、報酬ハッキングの影響を受けやすいことが明らかになっています。これは、画像間の最小限のスコア差が正規化後に増幅され、モデルが些細な利得を過剰に最適化することを促す幻の優位性を生み出し、最終的に画像生成プロセスを不安定にするためです。この問題に対処するため、我々はPref-GRPOを提案します。これは、ペアワイズ選好報酬に基づくGRPO手法であり、最適化の目的をスコア最大化から選好適合にシフトし、より安定したトレーニングを保証します。Pref-GRPOでは、各グループ内で画像をペアワイズ比較し、選好RMを使用して勝率を報酬信号として利用します。大規模な実験により、PREF-GRPOが微妙な画像品質の違いを識別し、より安定した優位性を提供し、報酬ハッキングを軽減することが実証されています。さらに、既存のT2Iベンチマークは粗い評価基準に制限されており、包括的なモデル評価を妨げています。これを解決するため、我々はUniGenBenchを導入します。これは、5つの主要テーマと20のサブテーマにわたる600のプロンプトから構成される統一されたT2Iベンチマークです。MLLMを活用してベンチマークの構築と評価を行い、10の主要基準と27のサブ基準を通じて意味的一貫性を評価します。我々のベンチマークは、オープンソースおよびクローズドソースのT2Iモデルの長所と短所を明らかにし、Pref-GRPOの有効性を検証します。

English

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

Pref-GRPO: 安定したテキストから画像への強化学習のためのペアワイズ選好報酬ベースGRPO

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

要旨

Support