言語モデルのアラインメントのための自己対戦選好最適化

要旨

従来の人間のフィードバックからの強化学習（RLHF）アプローチは、Bradley-Terryモデルのようなパラメトリックモデルに依存しており、人間の選好における非推移性や非合理性を十分に捉えることができません。最近の進展では、選好確率を直接扱うことが、人間の選好をより正確に反映し、より柔軟で正確な言語モデルのアラインメントを可能にすることが示唆されています。本論文では、言語モデルのアラインメント問題を定和二人ゲームとして扱い、ナッシュ均衡ポリシーを特定することを目指す、セルフプレイベースの手法を提案します。私たちのアプローチは、Self-Play Preference Optimization（SPPO）と名付けられ、反復的なポリシー更新を通じてナッシュ均衡を近似し、理論的な収束保証を享受します。私たちの手法は、選択された応答の対数尤度を効果的に増加させ、拒否された応答の対数尤度を減少させることができ、これはDirect Preference Optimization（DPO）やIdentity Preference Optimization（IPO）のような対称的なペアワイズ損失では簡単には達成できません。私たちの実験では、UltraFeedbackデータセットからの60kのプロンプト（応答なし）のみを使用し、プロンプトの拡張を行わず、わずか0.4Bパラメータの事前学習済み選好モデルPairRMを活用することで、SPPOはMistral-7B-Instruct-v0.2のファインチューニングから得られたモデルが、AlpacaEval 2.0においてGPT-4-Turboに対する長さ制御付き勝率28.53%という最先端の結果を達成しました。また、MT-BenchおよびOpen LLM Leaderboardにおいても、（反復的な）DPOやIPOを上回りました。特に、SPPOの強力な性能は、GPT-4や他のより強力な言語モデルからの追加の外部監視（例：応答、選好など）なしに達成されています。

English

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

言語モデルのアラインメントのための自己対戦選好最適化

Self-Play Preference Optimization for Language Model Alignment

要旨

Support