基于评分标准的强化学习中的奖励欺骗

摘要

基于可验证奖励的强化学习已在数学和编程等领域实现了显著的后训练提升，但许多开放场景仍依赖基于评分标准的奖励。我们研究了基于评分标准的强化学习中的奖励破解现象：策略在训练过程中针对训练验证器进行优化，但评估时采用跨家族的三个前沿评审器组成的小组，从而减少对单一评估者的依赖。我们的框架将两类差异来源分离：验证器失效——训练验证器认可的评分标准维度被参考验证器拒绝；以及评分标准设计缺陷——即使基于评分标准的强验证器也会偏好那些无评分标准评审器整体评价更差的回答。在医学和科学领域，弱验证器会产生无法迁移至参考验证器的巨大代理奖励增益；这种利用行为随训练进程加剧，并集中在重复性失效模式中，例如复合标准的局部满足、将隐含内容视为显式内容、以及主题匹配不精确。强验证器能大幅减少（但无法消除）验证器利用行为。我们还引入了一种基于策略对数概率的无验证器诊断指标——自我内化差距，该指标能追踪参考验证器的质量，检测出当使用弱验证器训练的策略停止改进的时刻。最后，在我们的设定中，当评分标准未明确重要失效模式时，更强的验证无法阻止奖励破解：基于评分标准的验证器偏好强化学习检查点，而无评分标准评审器则偏好基础模型。这种分歧伴随以下现象：收益集中在完整性和存在性标准上，而事实准确性、简洁性、相关性和整体质量则下降。综合来看，这些结果表明更强的验证能减少奖励破解，但无法确保评分标准增益必然对应更广泛的质量提升。

English

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

基于评分标准的强化学习中的奖励欺骗

Reward Hacking in Rubric-Based Reinforcement Learning

摘要

Support