大型語言模型對齊的加速偏好優化
Accelerated Preference Optimization for Large Language Model Alignment
October 8, 2024
作者: Jiafan He, Huizhuo Yuan, Quanquan Gu
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)已成為將大型語言模型(LLMs)與人類偏好對齊的關鍵工具。直接偏好優化(DPO)是最流行的方法之一,將RLHF形式化為一個政策優化問題,而無需明確估計獎勵函數。它克服了兩步方法的穩定性和效率問題,這些方法通常涉及首先估計獎勵函數,然後通過近端政策優化(PPO)來優化政策。由於RLHF本質上是一個優化問題,而且眾所周知,動量技術在理論上和實際上都可以加速優化,一個自然的問題就出現了:RLHF能否通過動量加速?本文肯定地回答了這個問題。具體而言,我們首先展示了迭代偏好優化方法可以被視為一種近端點方法。基於這一觀察,我們提出了一個通用的加速偏好優化(APO)框架,該框架統一了許多現有的偏好優化算法,並採用Nesterov的動量技術來加速LLMs的對齊。從理論上來看,我們證明了APO可以實現比標準的迭代偏好優化方法更快的收斂速度,包括DPO和自我對弈偏好優化(SPPO)。在實踐中,我們展示了APO在AlpacaEval 2.0基準測試中優於DPO、迭代DPO和其他強基準的RLHF。
English
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal
tool for aligning large language models (LLMs) with human preferences. Direct
Preference Optimization (DPO), one of the most popular approaches, formulates
RLHF as a policy optimization problem without explicitly estimating the reward
function. It overcomes the stability and efficiency issues of two-step
approaches, which typically involve first estimating the reward function and
then optimizing the policy via proximal policy optimization (PPO). Since RLHF
is essentially an optimization problem, and it is well-known that momentum
techniques can accelerate optimization both theoretically and empirically, a
natural question arises: Can RLHF be accelerated by momentum? This paper
answers this question in the affirmative. In detail, we first show that the
iterative preference optimization method can be viewed as a proximal point
method. Based on this observation, we propose a general Accelerated Preference
Optimization (APO) framework, which unifies many existing preference
optimization algorithms and employs Nesterov's momentum technique to speed up
the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a
faster convergence rate than the standard iterative preference optimization
methods, including DPO and Self-Play Preference Optimization (SPPO).
Empirically, we show the superiority of APO over DPO, iterative DPO, and other
strong baselines for RLHF on the AlpacaEval 2.0 benchmark.Summary
AI-Generated Summary