理性獎勵:推理獎勵在訓練與測試階段雙向調控視覺生成
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
April 13, 2026
作者: Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen
cs.AI
摘要
當前大多數視覺生成獎勵模型將豐富的人類判斷簡化為單一且未經解釋的分數,從而丟失了偏好背後的推理過程。我們的研究表明,訓練獎勵模型在評分前先產生明確的多維度批評,能將其從被動評估工具轉變為主動優化工具,並通過兩種互補方式改進生成器:在訓練階段,結構化推理能為強化學習提供可解釋的細粒度獎勵;在測試階段,「生成-批評-優化」循環可將批評轉化為針對性的提示詞修正,從而無需參數更新即可提升輸出質量。為避免耗費高昂的推理標註成本,我們提出偏好錨定推理(PARROT)框架,通過錨定生成、一致性篩選和蒸餾技術,從現成的偏好數據中還原高質量推理。由此產生的模型 RationalRewards(80億參數)在開源獎勵模型中實現了最先進的偏好預測性能,與 Gemini-2.5-Pro 相媲美,且訓練數據量比同類基線模型減少10-20倍。作為強化學習獎勵,它相較於純量獎勵能持續提升文本到圖像及圖像編輯生成器的表現。最引人注目的是,其測試階段的批評優化循環在多項基準測試中達到甚至超越了基於強化學習的微調效果,這表明結構化推理能激發現有生成器中因次優提示詞而未被釋放的潛在能力。
English
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.