RLHF의 이면: 보상 모델 자기 지도 개선을 위한 온-정책 피드백

초록

언어 모델 정렬을 위한 강력한 보상 모델(RM) 구축은 인간 주석 또는 판별 모델로부터 다양하고 신뢰할 수 있는 선호 데이터를 획득하는 비용과 어려움으로 인해 병목 현상에 직면해 있다. 특히 정책이 정적 RM 훈련을 넘어 진화함에 따라 이러한 문제는 훨씬 더 심각해진다. 이에 따라 우리는 SAVE(가치 고정 온-정책 피드백을 통한 자기 지도 보상 모델 개선) 프레임워크를 제안한다. 이는 온-정책 RM 훈련을 위해 가치 함수를 사용하여 온-정책 응답을 피드백으로 평가한다. SAVE는 보상 평가된 온-정책 응답을 적응형 앵커 역할을 하는 프롬프트 특정 가치 헤드를 사용하여 감독 신호로 자연스럽게 변환한다. 또한 RM 이점을 계산하고 모호한 샘플을 필터링하여 대조 목적 함수를 통해 RM을 업데이트한다. SAVE가 RM 훈련 향상에 미치는 효과는 여섯 가지 다양한 벤치마크에 걸친 엄격한 실증 평가를 통해 강력히 검증되었다. 이는 모든 데이터셋에서 우수한 결과를 달성할 뿐만 아니라 세 가지 RL 알고리즘(GRPO, RLOO, GSPO)과 다양한 정책 백본에서 일관된 개선을 유지한다.

English

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.