미러 프록스를 통한 인간 피드백 기반 내쉬 학습 가속화

초록

전통적인 인간 피드백 강화 학습(RLHF)은 종종 보상 모델에 의존하며, 브래들리-테리(Bradley-Terry) 모델과 같은 선호 구조를 가정하는 경우가 많습니다. 그러나 이러한 모델은 실제 인간 선호의 복잡성(예: 비이행성)을 정확히 포착하지 못할 수 있습니다. 내쉬 인간 피드백 학습(NLHF)은 이러한 선호를 기반으로 정의된 게임의 내쉬 균형을 찾는 문제로 접근함으로써 더 직접적인 대안을 제공합니다. 본 연구에서는 내쉬 미러 프록스(Nash-MP)를 소개합니다. 이는 미러 프록스 최적화 기법을 활용하여 빠르고 안정적으로 내쉬 균형에 수렴하는 온라인 NLHF 알고리즘입니다. 우리의 이론적 분석은 Nash-MP가 베타 정규화된 내쉬 균형으로의 마지막 반복 선형 수렴을 보인다는 것을 입증합니다. 특히, 최적 정책과의 KL 발산이 (1+2베타)^{-N/2}의 속도로 감소함을 증명하며, 여기서 N은 선호 질의의 수입니다. 또한, 우리는 악용 가능성 격차와 로그 확률의 스팬 준노름에 대해 마지막 반복 선형 수렴을 보이며, 이러한 수렴 속도가 행동 공간의 크기에 독립적임을 입증합니다. 더 나아가, 우리는 근사 버전의 Nash-MP를 제안하고 분석합니다. 이 버전에서는 확률적 정책 그래디언트를 사용하여 근위 단계를 추정함으로써 알고리즘을 실제 응용에 더 가깝게 만듭니다. 마지막으로, 대규모 언어 모델을 미세 조정하기 위한 실용적인 구현 전략을 상세히 설명하고, 기존 방법과의 호환성과 경쟁력 있는 성능을 입증하는 실험 결과를 제시합니다.

English

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. In this work, we introduce Nash Mirror Prox (Nash-MP), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium. Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the beta-regularized Nash equilibrium. Specifically, we prove that the KL-divergence to the optimal policy decreases at a rate of order (1+2beta)^{-N/2}, where N is a number of preference queries. We further demonstrate last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities, with all these rates being independent of the size of the action space. Furthermore, we propose and analyze an approximate version of Nash-MP where proximal steps are estimated using stochastic policy gradients, making the algorithm closer to applications. Finally, we detail a practical implementation strategy for fine-tuning large language models and present experiments that demonstrate its competitive performance and compatibility with existing methods.

미러 프록스를 통한 인간 피드백 기반 내쉬 학습 가속화

Accelerating Nash Learning from Human Feedback via Mirror Prox

초록

Support