大型語言模型中RLHF的奧秘第二部分:獎勵建模
Secrets of RLHF in Large Language Models Part II: Reward Modeling
January 11, 2024
作者: Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
基於人類回饋的強化學習(RLHF)已成為將語言模型與人類價值觀和意圖對齊的關鍵技術,使模型能生成更有益且無害的回應。獎勵模型作為人類偏好的代理被訓練,用以驅動強化學習優化。儘管獎勵模型被視為實現高效能的關鍵,其實際應用仍面臨以下挑戰:(1)數據集中錯誤與模糊的偏好配對可能阻礙獎勵模型準確捕捉人類意圖;(2)基於特定分佈數據訓練的獎勵模型,往往難以泛化至該分佈外的樣本,且不適用於迭代式RLHF訓練。
本報告針對這兩大問題提出解決方案:(1)從數據角度,我們基於多獎勵模型的投票機制,提出量化數據中偏好強度的方法。實驗結果證實,不同強度的偏好數據對獎勵模型性能影響各異。我們引入一系列創新方法以減輕數據中錯誤與模糊偏好的影響,並充分發揮高質量偏好數據的價值;(2)從算法角度,我們採用對比學習增強獎勵模型區分「被選中」與「被拒絕」回應的能力,從而提升模型泛化性。更進一步,我們運用元學習使獎勵模型保持對分佈外樣本細微差異的辨識能力,此方法可應用於迭代式RLHF優化。
English
Reinforcement Learning from Human Feedback (RLHF) has become a crucial
technology for aligning language models with human values and intentions,
enabling models to produce more helpful and harmless responses. Reward models
are trained as proxies for human preferences to drive reinforcement learning
optimization. While reward models are often considered central to achieving
high performance, they face the following challenges in practical applications:
(1) Incorrect and ambiguous preference pairs in the dataset may hinder the
reward model from accurately capturing human intent. (2) Reward models trained
on data from a specific distribution often struggle to generalize to examples
outside that distribution and are not suitable for iterative RLHF training.
In this report, we attempt to address these two issues. (1) From a data
perspective, we propose a method to measure the strength of preferences within
the data, based on a voting mechanism of multiple reward models. Experimental
results confirm that data with varying preference strengths have different
impacts on reward model performance. We introduce a series of novel methods to
mitigate the influence of incorrect and ambiguous preferences in the dataset
and fully leverage high-quality preference data. (2) From an algorithmic
standpoint, we introduce contrastive learning to enhance the ability of reward
models to distinguish between chosen and rejected responses, thereby improving
model generalization. Furthermore, we employ meta-learning to enable the reward
model to maintain the ability to differentiate subtle differences in
out-of-distribution samples, and this approach can be utilized for iterative
RLHF optimization.