多人納什偏好優化

摘要

基於人類反饋的強化學習（RLHF）已成為對齊大型語言模型（LLMs）與人類偏好的標準範式。然而，基於布萊德利-特里假設的獎勵方法難以捕捉現實世界偏好的非傳遞性和異質性特徵。為解決這一問題，近期研究將對齊問題重新定義為雙人納什博弈，從而催生了基於納什學習的人類反饋（NLHF）。儘管這一視角啟發了如INPO、ONPO和EGPO等具有堅實理論與實證保證的算法，但它們本質上仍局限於雙人互動，形成了單一對手的偏見，無法捕捉現實偏好結構的全部複雜性。在本研究中，我們提出了多人納什偏好優化（MNPO），這是一個將NLHF推廣至多人場景的新框架。它將對齊問題建模為一個n人博弈，其中每個策略在與對手群體競爭的同時，還被正則化以接近參考模型。我們的框架在多人設置中建立了明確的納什均衡，並將對偶間隙的概念擴展以量化近似質量。我們證明，MNPO不僅繼承了雙人方法的均衡保證，還能實現更豐富的競爭動態和對多樣化偏好結構的更好覆蓋。通過全面的實證評估，我們展示了MNPO在指令遵循基準測試中持續超越現有的NLHF基線，在異質註釋者條件和混合策略評估場景下實現了更優的對齊質量。這些成果共同確立了MNPO作為一個原則性且可擴展的框架，用於將LLMs與複雜、非傳遞性的人類偏好對齊。代碼已公開於https://github.com/smiles724/MNPO。

English

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.