반복적 내쉬 정책 최적화: 무후회 학습을 통한 대규모 언어 모델과 일반적 선호도의 정렬

초록

인간 피드백을 통한 강화 학습(RLHF)은 대규모 언어 모델(LLM)을 인간의 선호도에 맞추는 데 큰 성공을 거두었습니다. 현재 널리 사용되는 RLHF 접근 방식은 보상 기반이며 Bradley-Terry(BT) 모델 가정을 따르는데, 이는 인간 선호도의 복잡성을 완전히 포착하지 못할 수 있습니다. 본 논문에서는 일반적인 선호도 프레임워크 하에서 RLHF를 탐구하고 게임 이론적 관점에서 접근합니다. 구체적으로, 문제를 두 명의 플레이어 게임으로 공식화하고 새로운 알고리즘인 반복 내시 정책 최적화(INPO)를 제안합니다. 핵심 아이디어는 정책이 무후회 학습을 통해 스스로와 경쟁함으로써 내시 정책을 근사화하는 것입니다. 기존 방법과 달리, INPO는 개별 응답에 대한 예상 승률을 추정할 필요를 우회하며, 이는 일반적으로 높은 계산 비용이나 주석 비용을 초래합니다. 대신, 우리는 선호도 데이터셋에서 직접 최소화되는 새로운 손실 목표를 도입합니다. 우리는 이 접근 방식에 대한 이론적 분석을 제공하고 다양한 대표적인 벤치마크에서의 실험을 통해 그 효과를 입증합니다. LLaMA-3-8B 기반의 SFT 모델을 사용하여, INPO는 AlpacaEval 2.0에서 41.5%의 길이 제어 승률을, Arena-Hard에서 38.3%의 승률을 달성하며, BT 모델 가정 하에서 최신 반복 알고리즘[Dong et al., 2024] 대비 상당한 개선을 보여줍니다. 또한, 우리의 어블레이션 연구는 응답 길이 제어를 위해 KL 정규화를 통합하는 이점을 강조합니다.

English

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

반복적 내쉬 정책 최적화: 무후회 학습을 통한 대규모 언어 모델과 일반적 선호도의 정렬

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

초록

Support