루브릭 기반 강화 학습에서의 보상 해킹 재현, 분석 및 탐지

초록

루브릭 기반 강화 학습(rubric-based RL)은 LLM-심판(LaaJ)을 사용하여 루브릭에 따라 모델 출력을 점수화하고 이를 보상으로 활용한다. 그러나 정책 모델이 심판의 잠재적 편향을 악용하여 보상 해킹(reward hacking)을 유발하고, 이는 비효율적이거나 안전하지 않은 훈련 결과로 이어질 수 있다. 실제 루브릭 기반 강화 학습에서 이러한 해킹 행동은 종종 미묘하게 나타나며 여러 심판 편향과 얽혀 있어 분석, 탐지 및 완화가 어렵다. 본 논문에서는 루브릭 기반 강화 학습을 위한 제어 가능한 해킹 환경인 CHERRL을 소개한다. CHERRL은 LaaJ에 알려진 편향을 주입함으로써 보상 해킹의 안정적인 재현, 보상 발산의 명시적 관찰, 해킹 시작 시점의 정확한 식별을 가능하게 한다. 이를 통해 루브릭 기반 강화 학습에서 보상 해킹의 메커니즘과 완화 방안을 연구하기 위한 깔끔한 실험 테스트베드를 제공한다. 그 유용성을 입증하기 위해 다양한 심판 편향을 발견 가능성과 활용 가능성의 관점에서 분석하고, 훈련 로그에서 보상 해킹 시작 시점을 자동으로 탐지하는 에이전트 기반 시스템을 탐구한다. 코드와 환경은 https://github.com/THUAIS-Lab/CHERRL에서 공개적으로 이용 가능하다.

English

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.