多人纳什偏好优化
Multiplayer Nash Preference Optimization
September 27, 2025
作者: Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi
cs.AI
摘要
基于人类反馈的强化学习(RLHF)已成为将大型语言模型(LLMs)与人类偏好对齐的标准范式。然而,建立在Bradley-Terry假设基础上的奖励方法难以捕捉现实世界偏好的非传递性和异质性。为此,近期研究将对齐问题重新定义为双人纳什博弈,催生了基于纳什的人类反馈学习(NLHF)。尽管这一视角启发了如INPO、ONPO和EGPO等具有坚实理论与实证保证的算法,但它们本质上仍局限于双人互动,形成了单一对手偏差,无法全面反映现实偏好结构的复杂性。本研究中,我们提出了多人纳什偏好优化(MNPO),这一新颖框架将NLHF推广至多人场景。它将对齐问题建模为n人博弈,其中每个策略在向参考模型正则化的同时,与一组对手竞争。我们的框架在多人设定下确立了明确的纳什均衡,并扩展了对偶间隙的概念以量化近似质量。我们证明,MNPO不仅继承了双人方法的均衡保证,还能激发更丰富的竞争动态,提升对多样化偏好结构的覆盖。通过全面的实证评估,我们展示了MNPO在指令跟随基准测试中持续超越现有NLHF基线,在异质标注者条件和混合策略评估场景下实现了更优的对齐质量。这些成果共同确立了MNPO作为一个原则性强、可扩展的框架,用于将LLMs与复杂、非传递的人类偏好对齐。代码可在https://github.com/smiles724/MNPO获取。
English
Reinforcement learning from human feedback (RLHF) has emerged as the standard
paradigm for aligning large language models (LLMs) with human preferences.
However, reward-based methods built on the Bradley-Terry assumption struggle to
capture the non-transitive and heterogeneous nature of real-world preferences.
To address this, recent studies have reframed alignment as a two-player Nash
game, giving rise to Nash learning from human feedback (NLHF). While this
perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong
theoretical and empirical guarantees, they remain fundamentally restricted to
two-player interactions, creating a single-opponent bias that fails to capture
the full complexity of realistic preference structures. In this work, we
introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework
that generalizes NLHF to the multiplayer regime. It formulates alignment as an
n-player game, where each policy competes against a population of opponents
while being regularized toward a reference model. Our framework establishes
well-defined Nash equilibria in multiplayer settings and extends the concept of
duality gap to quantify approximation quality. We demonstrate that MNPO
inherits the equilibrium guarantees of two-player methods while enabling richer
competitive dynamics and improved coverage of diverse preference structures.
Through comprehensive empirical evaluation, we show that MNPO consistently
outperforms existing NLHF baselines on instruction-following benchmarks,
achieving superior alignment quality under heterogeneous annotator conditions
and mixed-policy evaluation scenarios. Together, these results establish MNPO
as a principled and scalable framework for aligning LLMs with complex,
non-transitive human preferences. Code is available at
https://github.com/smiles724/MNPO.