理性奖励：推理奖励在训练与测试阶段双维度优化视觉生成

摘要

当前大多数视觉生成奖励模型将丰富的人类判断简化为单一未解释的分数，丢弃了偏好背后的推理过程。我们证明，通过训练奖励模型在评分前生成明确的多维度评析，可将其从被动评估工具转变为主动优化工具，通过两种互补方式改进生成器：在训练阶段，结构化理据为强化学习提供可解释的细粒度奖励；在测试阶段，"生成-评析-优化"循环将评析转化为针对性提示词修订，无需参数更新即可提升输出质量。为免去昂贵的理据标注成本，我们提出偏好锚定推理框架（PARROT），通过锚定生成、一致性过滤和蒸馏三个原则性步骤，从现有偏好数据中还原高质量理据。由此得到的RationalRewards模型（80亿参数）在开源奖励模型中实现偏好预测的最优性能，与Gemini-2.5-Pro相当，而训练数据量比同类基线少10-20倍。作为强化学习奖励，它在文本到图像和图像编辑任务中持续优于标量奖励模型。最引人注目的是，其测试时评析优化循环在多个基准测试中达到甚至超越基于强化学习的微调效果，这表明结构化推理能激发现有生成器中被次优提示词所掩盖的潜在能力。

English

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

理性奖励：推理奖励在训练与测试阶段双维度优化视觉生成

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

摘要

Support