멀티플레이어 내시 선호도 최적화

초록

인간 피드백을 통한 강화 학습(RLHF)은 대규모 언어 모델(LLM)을 인간의 선호도와 일치시키는 표준 패러다임으로 부상했습니다. 그러나 Bradley-Terry 가정에 기반한 보상 기반 방법은 현실 세계의 선호도의 비이행적이고 이질적인 특성을 포착하는 데 어려움을 겪습니다. 이를 해결하기 위해 최근 연구들은 정렬 문제를 두 명의 플레이어 간 내쉬 게임으로 재구성하여, 내쉬 학습을 통한 인간 피드백(NLHF)이라는 접근법을 제안했습니다. 이 관점은 INPO, ONPO, EGPO와 같은 강력한 이론적 및 실증적 보장을 가진 알고리즘을 탄생시켰지만, 이들은 근본적으로 두 명의 플레이어 상호작용에 제한되어 있어 단일 상대 편향을 초래하며, 현실적인 선호 구조의 전체 복잡성을 포착하지 못합니다. 본 연구에서는 NLHF를 다중 플레이어 체제로 일반화한 새로운 프레임워크인 다중 플레이어 내쉬 선호 최적화(MNPO)를 소개합니다. 이 프레임워크는 정렬 문제를 n명의 플레이어 게임으로 공식화하며, 각 정책은 참조 모델을 향해 정규화되면서 다수의 상대와 경쟁합니다. 우리의 프레임워크는 다중 플레이어 설정에서 잘 정의된 내쉬 균형을 확립하고, 근사 품질을 정량화하기 위해 이중 간격 개념을 확장합니다. MNPO는 두 명의 플레이어 방법의 균형 보장을 상속받으면서도 더 풍부한 경쟁 역학과 다양한 선호 구조의 향상된 커버리지를 가능하게 합니다. 포괄적인 실증적 평가를 통해 MNPO가 지시 따르기 벤치마크에서 기존 NLHF 기준선을 지속적으로 능가하며, 이질적인 주석자 조건과 혼합 정책 평가 시나리오에서 우수한 정렬 품질을 달성함을 보여줍니다. 이러한 결과들은 MNPO가 복잡하고 비이행적인 인간 선호도와 LLM을 정렬하기 위한 원칙적이고 확장 가능한 프레임워크임을 입증합니다. 코드는 https://github.com/smiles724/MNPO에서 확인할 수 있습니다.

English

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.