RLHF的另一面：用於獎勵模型自監督改進的在策略反饋

摘要

建立強大的獎勵模型以對齊語言模型，其瓶頸在於從人類標註或評判模型中獲取多樣且可靠的偏好數據，不僅成本高且困難重重。當策略進化超越靜態獎勵模型訓練時，此問題更是急遽惡化。為此，我們提出SAVE（基於價值錨定的在策略反饋之自我監督獎勵模型改進）框架，該框架透過使用價值函數，將在策略回應評分作為反饋，用於在策略獎勵模型訓練。SAVE 自然地將經獎勵評分後的在策略回應轉換為監督訊號，並以特定提示的價值頭作為自適應錨點。它計算獎勵模型優勢值，並過濾模糊樣本，藉由對比目標來更新獎勵模型。透過在六個多樣化基準上進行嚴謹的實證評估，SAVE 對於強化獎勵模型訓練的有效性獲得強烈驗證。它在所有資料集上均達到超越現有方法的結果，同時在三種強化學習演算法（GRPO、RLOO、GSPO）及不同策略骨幹網路中保持一致的改進。

English

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.