基於評分標準之強化學習中的獎勵操縱

摘要

基於可驗證獎勵的強化學習已在數學與程式設計等領域實現強大的後訓練增益，然而許多開放式設定仍依賴基於評分標準的獎勵。我們研究了基於評分標準的強化學習中的獎勵破解現象，其中策略針對訓練驗證器進行優化，但透過跨模型系列的評判者面板（包含三個前沿評判者）進行評估，以降低對單一評估者的依賴。我們提出的框架區分了兩種分歧來源：驗證器失效（訓練驗證器認可的評分標準準則被參考驗證器否定）與評分標準設計限制（即使強大的基於評分標準的驗證器也更偏好那些無評分標準評判者整體評分較低的回應）。在醫學與科學領域中，弱驗證器會產生無法轉移至參考驗證器的大型代理獎勵增益；隨著訓練進行，利用行為增長並集中在反覆出現的失效類型，例如部分滿足複合準則、將隱含內容視為明確內容，以及不精確的主題匹配。較強的驗證器能大幅減少（但無法消除）驗證器利用行為。我們也引入了自我內化差距，這是一種基於策略對數機率的無驗證器診斷方法，能追蹤參考驗證器品質，並在弱驗證器訓練的策略停止改善時加以檢測。最後，在我們的設定中，當評分標準遺漏重要的失效模式時，較強的驗證無法阻止獎勵破解：基於評分標準的驗證器偏好強化學習檢查點，而無評分標準的評判者則偏好基礎模型。這些不一致性恰好出現在增益集中於完整性與存在性準則，同時事實正確性、簡潔性、相關性與整體品質出現下降的情況。綜合而言，這些結果表明較強的驗證能減少獎勵破解，但本身無法確保評分標準增益對應於更廣泛的品質增益。

English

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

基於評分標準之強化學習中的獎勵操縱

Reward Hacking in Rubric-Based Reinforcement Learning

摘要

Support