ルーブリックベースの強化学習における報酬ハッキングの再現、分析、および検出

要旨

ルーブリックベースの強化学習（Reinforcement Learning, RL）では、LLM-as-a-Judge（LaaJ）を用いて、ルーブリックに従ってモデルの出力をスコアリングし、これを報酬として利用する。しかし、ポリシーモデルが評価者の潜在的なバイアスを悪用することで、報酬ハッキングが発生し、効果的でない、あるいは安全でない学習結果を招く可能性がある。現実のルーブリックベースRLにおいて、このようなハッキング動作はしばしば微妙であり、複数の評価者のバイアスが絡み合っているため、分析、検出、軽減が困難である。本稿では、ルーブリックベースRL向けの制御可能なハッキング環境であるCHERRLを紹介する。既知のバイアスをLaaJに注入することで、CHERRLは報酬ハッキングの安定的な再現、報酬の乖離の明示的な観察、そしてハッキング発生時点の正確な特定を可能にする。これにより、ルーブリックベースRLにおける報酬ハッキングのメカニズムとその軽減策を研究するための、クリーンな実験用テストベッドが提供される。その有用性を示すため、発見可能性と悪用可能性の観点から異なる評価者のバイアスを分析し、学習ログから報酬ハッキングの発生を自動的に検出するエージェントベースのシステムを探求する。コードと環境は https://github.com/THUAIS-Lab/CHERRL で公開されている。

English

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.