大型语言模型中RLHF的秘密 第二部分:奖励建模
Secrets of RLHF in Large Language Models Part II: Reward Modeling
January 11, 2024
作者: Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
人类反馈强化学习(RLHF)已经成为将语言模型与人类价值观和意图对齐的关键技术,使模型能够产生更有帮助和无害的回应。奖励模型被训练为人类偏好的代理,以推动强化学习优化。虽然奖励模型通常被认为是实现高性能的关键,但在实际应用中它们面临以下挑战:(1)数据集中错误和模糊的偏好对可能阻碍奖励模型准确捕捉人类意图。(2)在特定分布的数据上训练的奖励模型通常难以推广到分布之外的示例,并且不适用于迭代RLHF训练。
在本报告中,我们尝试解决这两个问题。 (1)从数据角度出发,我们提出了一种方法来衡量数据中偏好的强度,基于多个奖励模型的投票机制。实验结果证实,具有不同偏好强度的数据对奖励模型性能有不同影响。我们引入了一系列新方法来减轻数据集中错误和模糊偏好的影响,并充分利用高质量的偏好数据。 (2)从算法角度出发,我们引入对比学习来增强奖励模型区分所选和被拒绝回应的能力,从而提高模型的泛化能力。此外,我们采用元学习使奖励模型能够保持区分分布之外样本中微小差异的能力,这种方法可用于迭代RLHF优化。
English
Reinforcement Learning from Human Feedback (RLHF) has become a crucial
technology for aligning language models with human values and intentions,
enabling models to produce more helpful and harmless responses. Reward models
are trained as proxies for human preferences to drive reinforcement learning
optimization. While reward models are often considered central to achieving
high performance, they face the following challenges in practical applications:
(1) Incorrect and ambiguous preference pairs in the dataset may hinder the
reward model from accurately capturing human intent. (2) Reward models trained
on data from a specific distribution often struggle to generalize to examples
outside that distribution and are not suitable for iterative RLHF training.
In this report, we attempt to address these two issues. (1) From a data
perspective, we propose a method to measure the strength of preferences within
the data, based on a voting mechanism of multiple reward models. Experimental
results confirm that data with varying preference strengths have different
impacts on reward model performance. We introduce a series of novel methods to
mitigate the influence of incorrect and ambiguous preferences in the dataset
and fully leverage high-quality preference data. (2) From an algorithmic
standpoint, we introduce contrastive learning to enhance the ability of reward
models to distinguish between chosen and rejected responses, thereby improving
model generalization. Furthermore, we employ meta-learning to enable the reward
model to maintain the ability to differentiate subtle differences in
out-of-distribution samples, and this approach can be utilized for iterative
RLHF optimization.