多人纳什偏好优化

摘要

基于人类反馈的强化学习（RLHF）已成为将大型语言模型（LLMs）与人类偏好对齐的标准范式。然而，建立在Bradley-Terry假设基础上的奖励方法难以捕捉现实世界偏好的非传递性和异质性。为此，近期研究将对齐问题重新定义为双人纳什博弈，催生了基于纳什的人类反馈学习（NLHF）。尽管这一视角启发了如INPO、ONPO和EGPO等具有坚实理论与实证保证的算法，但它们本质上仍局限于双人互动，形成了单一对手偏差，无法全面反映现实偏好结构的复杂性。本研究中，我们提出了多人纳什偏好优化（MNPO），这一新颖框架将NLHF推广至多人场景。它将对齐问题建模为n人博弈，其中每个策略在向参考模型正则化的同时，与一组对手竞争。我们的框架在多人设定下确立了明确的纳什均衡，并扩展了对偶间隙的概念以量化近似质量。我们证明，MNPO不仅继承了双人方法的均衡保证，还能激发更丰富的竞争动态，提升对多样化偏好结构的覆盖。通过全面的实证评估，我们展示了MNPO在指令跟随基准测试中持续超越现有NLHF基线，在异质标注者条件和混合策略评估场景下实现了更优的对齐质量。这些成果共同确立了MNPO作为一个原则性强、可扩展的框架，用于将LLMs与复杂、非传递的人类偏好对齐。代码可在https://github.com/smiles724/MNPO获取。

English

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.