大規模言語モデルにおけるRLHFの秘訣パートII：報酬モデリング

要旨

人間のフィードバックによる強化学習（RLHF）は、言語モデルを人間の価値観や意図に整合させるための重要な技術となり、モデルがより有用で無害な応答を生成することを可能にしています。報酬モデルは、人間の選好を代理するものとして訓練され、強化学習の最適化を推進します。報酬モデルは高い性能を達成するために中心的な役割を果たすとされていますが、実際の応用においては以下の課題に直面しています：（1）データセット内の誤った曖昧な選好ペアが、報酬モデルが人間の意図を正確に捉えることを妨げる可能性があります。（2）特定の分布に基づくデータで訓練された報酬モデルは、その分布外の例に一般化するのが難しく、反復的なRLHF訓練には適していません。本報告書では、これらの2つの課題に対処することを試みます。（1）データの観点から、複数の報酬モデルの投票メカニズムに基づいて、データ内の選好の強さを測定する方法を提案します。実験結果は、選好の強さが異なるデータが報酬モデルの性能に異なる影響を与えることを確認しています。データセット内の誤った曖昧な選好の影響を軽減し、高品質な選好データを最大限に活用するための一連の新しい方法を導入します。（2）アルゴリズムの観点から、選ばれた応答と拒否された応答を区別する報酬モデルの能力を強化するために、コントラスティブ学習を導入し、モデルの一般化を改善します。さらに、メタ学習を採用して、報酬モデルが分布外サンプルにおける微妙な違いを区別する能力を維持できるようにし、このアプローチを反復的なRLHF最適化に利用します。

English

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.

大規模言語モデルにおけるRLHFの秘訣パートII：報酬モデリング

Secrets of RLHF in Large Language Models Part II: Reward Modeling

要旨

Support