ChatPaper.aiChatPaper

大型语言模型对齐的加速偏好优化

Accelerated Preference Optimization for Large Language Model Alignment

October 8, 2024
作者: Jiafan He, Huizhuo Yuan, Quanquan Gu
cs.AI

摘要

人类反馈强化学习(RLHF)已经成为将大型语言模型(LLMs)与人类偏好对齐的关键工具。直接偏好优化(DPO)是最流行的方法之一,它将RLHF表述为一个策略优化问题,而无需明确估计奖励函数。它克服了两步法的稳定性和效率问题,这些方法通常涉及首先估计奖励函数,然后通过近端策略优化(PPO)来优化策略。由于RLHF本质上是一个优化问题,而且众所周知,动量技术在理论上和实践中都可以加速优化,一个自然的问题就出现了:RLHF能否通过动量加速?本文肯定地回答了这个问题。具体来说,我们首先展示了迭代偏好优化方法可以被视为一种近端点方法。基于这一观察,我们提出了一个通用的加速偏好优化(APO)框架,统一了许多现有的偏好优化算法,并采用Nesterov的动量技术来加速LLMs的对齐。从理论上讲,我们证明了APO可以比标准的迭代偏好优化方法(包括DPO和自对弈偏好优化(SPPO))实现更快的收敛速度。在实证方面,我们展示了APO在AlpacaEval 2.0基准测试中优于DPO、迭代DPO和其他强基线方法的优越性。
English
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

Summary

AI-Generated Summary

PDF52November 16, 2024