ChatPaper.aiChatPaper

大型语言模型中RLHF的奥秘第二部分:奖励建模

Secrets of RLHF in Large Language Models Part II: Reward Modeling

January 11, 2024
作者: Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

基于人类反馈的强化学习(RLHF)已成为对齐语言模型与人类价值观及意图的关键技术,使模型能够生成更有益且更安全的回应。奖励模型通过训练成为人类偏好的代理,以驱动强化学习优化。尽管奖励模型通常被视为实现高性能的核心,但在实际应用中仍面临以下挑战:(1)数据集中存在错误和模糊的偏好对,可能阻碍奖励模型准确捕捉人类意图;(2)基于特定分布数据训练的奖励模型往往难以泛化至分布外样本,且无法适用于迭代式RLHF训练。 本报告旨在解决上述两个问题。(1)从数据角度出发,我们提出基于多奖励模型投票机制的偏好强度量化方法。实验结果证实,不同强度的偏好数据对奖励模型性能影响各异。我们引入一系列创新方法以削弱数据中错误与模糊偏好的影响,并充分挖掘高质量偏好数据的价值。(2)从算法层面,我们采用对比学习增强奖励模型对优选与拒答响应的区分能力,从而提升模型泛化性。此外,通过元学习使奖励模型保持对分布外样本细微差异的判别能力,该方法可应用于迭代式RLHF优化。
English
Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.
PDF274April 9, 2026