RLHFの逆側：報酬モデルの自己教師あり改善のためのオンポリシーフィードバック

要旨

言語モデルアライメントのための強力な報酬モデル（RM）の構築は、人間のアノテーションや判定モデルから多様で信頼性の高い選好データを取得するコストと難しさによってボトルネックとなっている。ポリシーが静的なRM訓練を超えて進化するにつれて、これは劇的に悪化する。そこで、我々はSAVE（Value-Anchored On-policy feedbackによる自己教師あり報酬モデル改善）を提案する。これは、価値関数を用いてオンポリシー応答をフィードバックとして評価し、オンポリシーRM訓練を行うフレームワークである。SAVEは、報酬で評価されたオンポリシー応答を、適応的アンカーとして機能するプロンプト固有の価値ヘッドを用いて教師信号に自然に変換する。それはRMアドバンテージを計算し、曖昧なサンプルをフィルタリングして、対照目的関数を介してRMを更新する。SAVEによるRM訓練強化の有効性は、6つの多様なベンチマークにわたる厳密な実証評価によって強く検証されている。それは全データセットで優れた結果を達成し、3つのRLアルゴリズム（GRPO、RLOO、GSPO）と異なるポリシーバックボーンにわたって一貫した改善を維持している。

English

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.