방향 정렬은 언어 모델을 위한 강화 학습에서 보상 해킹을 완화한다

초록

보상 해킹은 모델이 의도된 작업을 해결하는 대신 지름길을 활용하여 대리 보상을 개선할 때 발생한다. 우리는 언어 모델에서 강화 학습 업데이트의 기하학적 구조를 통해 이 실패 모드를 연구하며, 최적화가 안정적인 저차원 학습 궤적으로부터 이탈할 때 해킹이 발생한다고 주장한다. 우리는 매개변수 업데이트의 지배적인 특이 방향을 통해 이러한 이탈을 분석하며, 보상 해킹 실행이 깨끗한 실행보다 훨씬 더 큰 방향 변화를 보인다는 것을 보여준다. 이 관찰에 착안하여, 우리는 기울기가 깨끗한 참조 부분 공간 내에 유지되도록 제약하는 신뢰 방향 투영을 도입한다. 수학적 추론에 대한 보상 해킹 실험 전반에 걸쳐, 제안된 접근 방식은 지름길 활용을 지연시키고 작업 성능을 더 잘 보존한다.

English

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.