人間のフィードバックからのナッシュ学習

要旨

人間のフィードバックからの強化学習（RLHF）は、大規模言語モデル（LLM）を人間の好みに合わせるための主要なパラダイムとして登場しました。通常、RLHFは、事前学習済みのLLMによって生成されたテキストペア間の選好として表現される人間のフィードバックから報酬モデルを学習する初期ステップを含みます。その後、強化学習アルゴリズムを通じて報酬モデルを最大化するようにLLMのポリシーを微調整します。しかし、現在の報酬モデルの固有の制限は、人間の好みの豊かさを完全に表現できないことと、サンプリング分布への依存性にあります。本研究では、ペアワイズ人間フィードバックを使用したLLMの微調整のための代替パイプラインを紹介します。私たちのアプローチは、プロンプトが与えられた2つの入力に基づいて条件付けられる選好モデルの初期学習を含み、その後、競合するポリシーによって生成された応答よりも常に好まれる応答を生成するポリシーを追求します。これにより、この選好モデルのナッシュ均衡を定義します。私たちはこのアプローチを人間のフィードバックからのナッシュ学習（NLHF）と呼びます。表形式のポリシー表現の文脈では、ミラー降下の原理に基づいた新しいアルゴリズムソリューション、Nash-MDを提示します。このアルゴリズムは、最後のイテレーションが正則化されたナッシュ均衡に収束する一連のポリシーを生成します。さらに、ポリシーのパラメトリック表現を探求し、深層学習アーキテクチャのための勾配降下アルゴリズムを紹介します。私たちのアプローチの有効性を示すために、テキスト要約タスクのためのLLMの微調整を含む実験結果を提示します。私たちは、NLHFが選好学習とポリシー最適化のための魅力的な道を提供し、LLMを人間の好みに合わせる分野を前進させる可能性があると信じています。

English

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

人間のフィードバックからのナッシュ学習

Nash Learning from Human Feedback

要旨

Support