大型語言模型中RLHF的秘密 第二部分:獎勵建模
Secrets of RLHF in Large Language Models Part II: Reward Modeling
January 11, 2024
作者: Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
從人類反饋學習(RLHF)已成為對齊語言模型與人類價值和意圖至關重要的技術,使模型能夠產生更有幫助且無害的回應。獎勵模型被訓練為人類偏好的代理,以推動強化學習的優化。儘管獎勵模型通常被認為是實現高性能的關鍵,但在實際應用中,它們面臨以下挑戰:(1)數據集中不正確和含糊的偏好對可能阻礙獎勵模型準確捕捉人類意圖。(2)在特定分布的數據上訓練的獎勵模型通常難以推廣到該分布之外的示例,並且不適用於迭代RLHF訓練。
在本報告中,我們試圖解決這兩個問題。 (1)從數據角度出發,我們提出一種方法來衡量數據中偏好的強度,基於多個獎勵模型的投票機制。實驗結果證實,具有不同偏好強度的數據對獎勵模型性能有不同影響。我們引入了一系列新方法來減輕數據集中不正確和含糊偏好的影響,並充分利用高質量的偏好數據。(2)從算法角度出發,我們引入對比學習來增強獎勵模型區分所選和被拒絕回應的能力,從而提高模型的泛化能力。此外,我們採用元學習來使獎勵模型能夠保持區分分布之外樣本中微小差異的能力,並且這種方法可用於迭代RLHF優化。
English
Reinforcement Learning from Human Feedback (RLHF) has become a crucial
technology for aligning language models with human values and intentions,
enabling models to produce more helpful and harmless responses. Reward models
are trained as proxies for human preferences to drive reinforcement learning
optimization. While reward models are often considered central to achieving
high performance, they face the following challenges in practical applications:
(1) Incorrect and ambiguous preference pairs in the dataset may hinder the
reward model from accurately capturing human intent. (2) Reward models trained
on data from a specific distribution often struggle to generalize to examples
outside that distribution and are not suitable for iterative RLHF training.
In this report, we attempt to address these two issues. (1) From a data
perspective, we propose a method to measure the strength of preferences within
the data, based on a voting mechanism of multiple reward models. Experimental
results confirm that data with varying preference strengths have different
impacts on reward model performance. We introduce a series of novel methods to
mitigate the influence of incorrect and ambiguous preferences in the dataset
and fully leverage high-quality preference data. (2) From an algorithmic
standpoint, we introduce contrastive learning to enhance the ability of reward
models to distinguish between chosen and rejected responses, thereby improving
model generalization. Furthermore, we employ meta-learning to enable the reward
model to maintain the ability to differentiate subtle differences in
out-of-distribution samples, and this approach can be utilized for iterative
RLHF optimization.