直接ナッシュ最適化：一般選好を用いて言語モデルに自己改善を教える

要旨

本論文では、強力なオラクルからの選好フィードバックを用いて、大規模言語モデル（LLM）のポストトレーニングを行い、モデルが自己を反復的に改善する手法を研究する。LLMのポストトレーニングにおける典型的なアプローチは、人間のフィードバックからの強化学習（RLHF）であり、これは伝統的に報酬学習とその後のポリシー最適化を分離する。しかし、このような報酬最大化アプローチは、「点ごと」の報酬（例えばBradley-Terryモデル）の性質に制限され、複雑な非推移的または循環的な選好関係を表現できない。RLHFの進展により、報酬学習とポリシー最適化を単一の対照的な目的関数に統合して安定性を向上させることが示されているが、それでも報酬最大化の枠組みに縛られている。最近、新たな研究の波が報酬最大化の前提を回避し、「ペアごと」または一般的な選好を直接最適化する方向に進んでいる。本論文では、Direct Nash Optimization（DNO）を紹介する。これは、対照学習の簡潔さと安定性を、一般的な選好を最適化する理論的な一般性と組み合わせた、証明可能でスケーラブルなアルゴリズムである。DNOはバッチ処理されたオンラインポリシーアルゴリズムであり、回帰ベースの目的関数を使用するため、実装が直感的で効率的である。さらに、DNOは反復ごとに単調な改善を享受し、強力な教師モデル（例えばGPT-4）をも上回る改善が可能である。実験では、DNOによってアラインメントされた7BパラメータのOrca-2.5モデルが、AlpacaEval 2.0においてGPT-4-Turboに対する33%の勝率を達成し（応答長を制御した後でも）、初期モデルからの絶対的な改善率は26%（7%から33%）であった。これは、Mistral Large、Self-Rewarding LM（70Bパラメータ）、および旧バージョンのGPT-4など、はるかに多くのパラメータを持つモデルを上回る性能を示した。

English

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

直接ナッシュ最適化：一般選好を用いて言語モデルに自己改善を教える

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

要旨

Support