在基於評分量表的強化學習中重現、分析與檢測獎勵駭客

摘要

基于评分的强化学习（Rubric-based RL）利用大语言模型作为裁判（LLM-as-a-Judge, LaaJ），根据评分标准对模型输出进行打分作为奖励。然而，策略模型可能会利用裁判中的潜在偏差，导致奖励破解（reward hacking），从而产生无效或危险的训练结果。在实际的基于评分的强化学习中，这类破解行为往往表现微妙，并与多种裁判偏差交织在一起，使得分析、检测和缓解都变得困难。本文提出CHERRL——一个针对基于评分的强化学习的可控破解环境。通过向LaaJ注入已知偏差，CHERRL能够稳定复现奖励破解、清晰观察奖励发散，并精确识别破解的起始时间点。这为研究基于评分的强化学习中奖励破解的机制与缓解方法提供了一个干净的实验测试平台。为展示其用途，我们从可发现性和可利用性的角度分析了不同的裁判偏差，并探索了一种基于代理的系统，用于从训练日志中自动检测奖励破解的起始时间。代码与环境已在 https://github.com/THUAIS-Lab/CHERRL 公开。

English

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.