マルチプレイヤー・ナッシュ選好最適化

要旨

人間のフィードバックからの強化学習（RLHF）は、大規模言語モデル（LLM）を人間の選好に合わせるための標準的なパラダイムとして登場しました。しかし、Bradley-Terry仮定に基づく報酬ベースの手法は、現実世界の選好の非推移的かつ異質な性質を捉えるのに苦労しています。この問題に対処するため、最近の研究では、アラインメントを2プレイヤーのナッシュゲームとして再定義し、人間のフィードバックからのナッシュ学習（NLHF）を生み出しました。この視点は、INPO、ONPO、EGPOといった強力な理論的および経験的保証を持つアルゴリズムを生み出しましたが、これらは基本的に2プレイヤーの相互作用に限定されており、単一の対戦相手バイアスが生じ、現実的な選好構造の完全な複雑性を捉えることができません。本研究では、NLHFをマルチプレイヤーレジームに一般化する新しいフレームワークであるMultiplayer Nash Preference Optimization（MNPO）を紹介します。このフレームワークは、アラインメントをnプレイヤーゲームとして定式化し、各ポリシーが参照モデルに向けて正則化されながら、対戦相手の集団と競争します。私たちのフレームワークは、マルチプレイヤー設定で明確なナッシュ均衡を確立し、近似品質を定量化するために双対ギャップの概念を拡張します。MNPOが2プレイヤー手法の均衡保証を継承しながら、より豊かな競争ダイナミクスと多様な選好構造のカバレッジを可能にすることを示します。包括的な経験的評価を通じて、MNPOが指示追従ベンチマークにおいて既存のNLHFベースラインを一貫して上回り、異質なアノテーター条件や混合ポリシー評価シナリオ下で優れたアラインメント品質を達成することを示します。これらの結果は、MNPOが複雑で非推移的な人間の選好にLLMを合わせるための原則的でスケーラブルなフレームワークとして確立されることを示しています。コードはhttps://github.com/smiles724/MNPOで公開されています。

English

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.