RLHF的另一面:基于同策略反馈的奖励模型自监督改进
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
May 29, 2026
作者: Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng
cs.AI
摘要
构建用于语言模型对齐的强健奖励模型(RM)面临瓶颈,这一瓶颈源于从人工标注或评估模型获取多样且可靠的偏好数据的高昂成本与困难。当策略超越静态RM训练不断演进时,这一问题将急剧恶化。为此,我们提出SAVE(基于价值锚定的在策略反馈实现自监督奖励模型改进)框架,该框架利用价值函数对在策略响应进行评分,将其作为反馈用于在策略RM训练。SAVE通过将提示特定的价值头作为自适应锚点,自然地将奖励评分的在策略响应转化为监督信号。它计算RM优势值并过滤模糊样本,通过对比学习目标更新RM。通过六个多样化基准数据集的严格实证评估,SAVE对增强RM训练的有效性得到了有力验证。它在所有数据集上均取得超越现有方法的性能,同时在三种强化学习算法(GRPO、RLOO、GSPO)及不同策略骨干网络上保持一致的改进效果。
English
Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.