대규모 언어 모델 정렬을 위한 가속화된 선호도 최적화

초록

인간 피드백으로부터 강화 학습 (RLHF)은 대규모 언어 모델 (LLMs)을 인간의 선호에 맞추는 데 중요한 도구로 등장했습니다. 가장 인기 있는 접근 방식 중 하나인 직접 선호 최적화 (DPO)는 RLHF를 보상 함수를 명시적으로 추정하지 않고 정책 최적화 문제로 제시합니다. 이는 일반적으로 보상 함수를 먼저 추정한 다음 근접 정책 최적화 (PPO)를 통해 정책을 최적화하는 두 단계 접근 방식의 안정성과 효율성 문제를 극복합니다. RLHF가 본질적으로 최적화 문제이며 이론적으로나 경험적으로 최적화를 가속화할 수 있는 모멘텀 기법이 잘 알려져 있기 때문에 자연스럽게 질문이 제기됩니다: RLHF를 모멘텀으로 가속할 수 있을까? 본 논문은 이 질문에 긍정적으로 대답합니다. 구체적으로, 우리는 먼저 반복적 선호 최적화 방법을 근사점 방법으로 볼 수 있다는 것을 보여줍니다. 이 관찰을 바탕으로 우리는 많은 기존 선호 최적화 알고리즘을 통합하고 Nesterov의 모멘텀 기법을 활용하여 LLMs의 정렬 속도를 높이는 일반적 가속 선호 최적화 (APO) 프레임워크를 제안합니다. 이론적으로, APO가 DPO 및 Self-Play Preference Optimization (SPPO)을 포함한 표준 반복적 선호 최적화 방법보다 빠른 수렴 속도를 달성할 수 있다는 것을 입증합니다. 경험적으로, 우리는 AlpacaEval 2.0 벤치마크에서 RLHF에 대한 APO의 우월성을 DPO, 반복적 DPO 및 기타 강력한 기준선에 대해 보여줍니다.

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

대규모 언어 모델 정렬을 위한 가속화된 선호도 최적화

Accelerated Preference Optimization for Large Language Model Alignment

초록

Support