루브릭 기반 강화 학습에서의 보상 해킹

초록

검증 가능한 보상을 활용한 강화 학습은 수학 및 코딩과 같은 영역에서 강력한 사후 훈련 성과 향상을 가능하게 했지만, 많은 개방형 환경에서는 루브릭 기반 보상에 의존한다. 본 연구에서는 루브릭 기반 강화 학습에서의 보상 해킹을 분석하는데, 이는 정책이 훈련 검증자에 대해 최적화되지만 세 가지 최첨단 평가자로 구성된 교차 계열 패널을 통해 평가되어 단일 평가자에 대한 의존도를 낮추는 방식이다. 우리의 프레임워크는 두 가지 발산 원천을 분리한다: 훈련 검증자가 참조 검증자가 거부하는 루브릭 기준을 인정하는 검증자 오류, 그리고 강력한 루브릭 기반 검증자조차 루브릭 없는 평가자가 전반적으로 더 낮게 평가하는 응답을 선호하는 루브릭 설계 한계. 의학 및 과학 영역 전반에서, 약한 검증자는 참조 검증자로 전이되지 않는 큰 대리 보상 이득을 생성하며, 훈련이 진행됨에 따라 악용이 증가하고 복합 기준의 부분적 충족, 암묵적 내용을 명시적으로 처리, 부정확한 주제 매칭과 같은 반복적 실패에 집중된다. 더 강력한 검증자는 검증자 악용을 상당히 줄이지만 완전히 제거하지는 못한다. 또한 정책 로그 확률에 기반한 검증자 없는 진단 지표인 자기 내면화 격차를 도입하는데, 이는 참조 검증자의 품질을 추적하여 약한 검증자를 사용하여 훈련된 정책이 더 이상 개선되지 않는 시점을 감지한다. 마지막으로, 우리의 설정에서 더 강력한 검증은 루브릭이 중요한 실패 모드를 명시하지 않을 경우 보상 해킹을 방지하지 못한다: 루브릭 기반 검증자는 강화 학습 체크포인트를 선호하는 반면, 루브릭 없는 평가자는 기본 모델을 선호한다. 이러한 불일치는 완전성 및 존재 기반 기준에 집중된 이득과 함께 사실적 정확성, 간결성, 관련성 및 전반적 품질의 하락과 일치한다. 종합하면, 이러한 결과는 더 강력한 검증이 보상 해킹을 줄이지만, 그 자체만으로 루브릭 이득이 더 넓은 품질 이득에 대응됨을 보장하지는 않음을 시사한다.

English

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

루브릭 기반 강화 학습에서의 보상 해킹

Reward Hacking in Rubric-Based Reinforcement Learning

초록

Support