复现、分析与检测基于量规的强化学习中的奖励黑客行为

摘要

基于评分标准的强化学习（Rubric-based RL）采用大语言模型作为评判者（LLM-as-a-Judge, LaaJ）依据评分标准对模型输出进行打分，以此作为奖励信号。然而，策略模型可能利用评判者中存在的潜在偏见，导致奖励破解（reward hacking），产生无效甚至不安全的训练结果。在实际的基于评分标准的强化学习中，此类破解行为往往表现微妙，且与多种评判者偏见相互纠缠，使得分析、检测和缓解变得困难。本文提出CHERRL——一种用于基于评分标准强化学习的环境可控破解系统。通过向LaaJ注入已知偏见，CHERRL能够稳定复现奖励破解现象，明确观察奖励发散过程，并精确识别破解行为的触发时刻。这为研究基于评分标准强化学习中奖励破解的机制与缓解策略提供了清晰的实验平台。为展示其实用性，我们从可发现性与可利用性两个角度分析了不同评判者偏见，并探索了一种基于智能体的自动检测系统，用于从训练日志中识别奖励破解的触发点。相关代码与环境已在 https://github.com/THUAIS-Lab/CHERRL 公开。

English

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.