RationalRewards: 推論報酬がトレーニング時とテスト時の両方で視覚生成をスケーリングする

要旨

視覚生成のための既存の報酬モデルの多くは、豊かな人間の判断を単一の説明不能なスコアに還元し、選好の根底にある推論を捨象してきた。本論文では、報酬モデルに採点前に明示的で多次元的な批評を生成するように教えることで、受動的評価器から能動的最適化ツールへと変換し、生成器を二つの相補的な方法で改善できることを示す。学習時には、構造化された理由付けが強化学習のための解釈可能できめ細かい報酬を提供し、推論時には「生成-批評-改良」ループが批評を具体的なプロンプト修正へと変換し、パラメータ更新なしで出力を改善する。こうした報酬モデルを高コストな理由付けアノテーションなしで学習させるため、我々はPreference-Anchored Rationalization（PARROT）を提案する。これは、アンカー生成、一貫性フィルタリング、蒸留を通じて、容易に利用可能な選好データから高品質な理由付けを復元する原理的な枠組みである。その結果得られたモデルRationalRewards（8B）は、オープンソースの報酬モデルの中で最先端の選好予測精度を達成し、Gemini-2.5-Proに匹敵する性能を示しながら、同等のベースライン比で10～20倍少ない学習データで実現した。強化学習の報酬として用いた場合、テキストから画像への生成および画像編集タスクにおいて、単一スコアの代替手法を一貫して上回る改善をもたらした。最も注目すべきは、推論時の批評・改良ループが複数のベンチマークで強化学習に基づくファインチューニングに匹敵または凌駕する結果を示したことであり、構造化された推論が既存の生成器に潜在する能力を解放し、最適でないプロンプトでは引き出せなかった性能を発揮できる可能性を示唆している。

English

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

RationalRewards: 推論報酬がトレーニング時とテスト時の両方で視覚生成をスケーリングする

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

要旨

Support