大型語言模型中RLHF的秘密第二部分：獎勵建模

摘要

從人類反饋學習（RLHF）已成為對齊語言模型與人類價值和意圖至關重要的技術，使模型能夠產生更有幫助且無害的回應。獎勵模型被訓練為人類偏好的代理，以推動強化學習的優化。儘管獎勵模型通常被認為是實現高性能的關鍵，但在實際應用中，它們面臨以下挑戰：（1）數據集中不正確和含糊的偏好對可能阻礙獎勵模型準確捕捉人類意圖。（2）在特定分布的數據上訓練的獎勵模型通常難以推廣到該分布之外的示例，並且不適用於迭代RLHF訓練。在本報告中，我們試圖解決這兩個問題。（1）從數據角度出發，我們提出一種方法來衡量數據中偏好的強度，基於多個獎勵模型的投票機制。實驗結果證實，具有不同偏好強度的數據對獎勵模型性能有不同影響。我們引入了一系列新方法來減輕數據集中不正確和含糊偏好的影響，並充分利用高質量的偏好數據。（2）從算法角度出發，我們引入對比學習來增強獎勵模型區分所選和被拒絕回應的能力，從而提高模型的泛化能力。此外，我們採用元學習來使獎勵模型能夠保持區分分布之外樣本中微小差異的能力，並且這種方法可用於迭代RLHF優化。

English

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.

大型語言模型中RLHF的秘密第二部分：獎勵建模

Secrets of RLHF in Large Language Models Part II: Reward Modeling

摘要

Support