ルーブリックベースの強化学習における報酬ハッキング

要旨

検証可能な報酬を用いた強化学習は、数学やコーディングなどの領域において事後学習による大きな向上を可能にしてきたが、多くのオープンエンドな設定ではルーブリックベースの報酬に依存している。本研究では、ルーブリックベースの強化学習における報酬ハッキングを調査する。この設定では、方策が学習用検証器に対して最適化される一方、3つの先端評価器からなるクロスファミリーパネルを用いて評価され、単一の評価器への依存を低減している。我々の枠組みでは、乖離の2つの原因を区別する。すなわち、学習用検証器が参照検証器によって拒否されるルーブリック基準を評価してしまう「検証器の失敗」と、強力なルーブリックベースの検証器でさえ、ルーブリック不要評価器が全体的に低く評価する応答を好んでしまう「ルーブリック設計の限界」である。医療や科学の領域において、弱い検証器は大きな代理報酬の向上をもたらすが、その向上は参照検証器には転移しない。悪用は訓練の経過とともに増大し、複合基準の部分充足、暗黙的内容を明示的として扱うこと、不正確なトピックマッチングなどの繰り返し発生する失敗に集中する。より強力な検証器は検証器の悪用を大幅に低減するが、完全には排除しない。また、自己内在化ギャップを導入する。これは方策の対数確率に基づく検証器不要の診断指標であり、参照検証器の質を追跡し、弱い検証器を用いて訓練された方策が改善を停止するタイミングを検出する。最後に、我々の設定では、ルーブリックが重要な失敗モードを指定しないままである場合、より強力な検証は報酬ハッキングを防ぐことができない。ルーブリックベースの検証器はRLチェックポイントを好む一方、ルーブリック不要評価器はベースモデルを好む。これらの不一致は、完全性や存在基準に集中した向上と、事実の正確性、簡潔性、関連性、全体的な品質の低下と同時に発生する。総合すると、これらの結果は、より強力な検証が報酬ハッキングを低減するものの、それ自体ではルーブリックの向上がより広範な品質の向上に対応することを保証しないことを示唆している。

English

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

ルーブリックベースの強化学習における報酬ハッキング

Reward Hacking in Rubric-Based Reinforcement Learning

要旨

Support