ChatPaper.aiChatPaper

大型語言模型對齊的加速偏好優化

Accelerated Preference Optimization for Large Language Model Alignment

October 8, 2024
作者: Jiafan He, Huizhuo Yuan, Quanquan Gu
cs.AI

摘要

從人類反饋中學習的強化學習(RLHF)已成為將大型語言模型(LLMs)與人類偏好對齊的關鍵工具。直接偏好優化(DPO)是最流行的方法之一,將RLHF形式化為一個政策優化問題,而無需明確估計獎勵函數。它克服了兩步方法的穩定性和效率問題,這些方法通常涉及首先估計獎勵函數,然後通過近端政策優化(PPO)來優化政策。由於RLHF本質上是一個優化問題,而且眾所周知,動量技術在理論上和實際上都可以加速優化,一個自然的問題就出現了:RLHF能否通過動量加速?本文肯定地回答了這個問題。具體而言,我們首先展示了迭代偏好優化方法可以被視為一種近端點方法。基於這一觀察,我們提出了一個通用的加速偏好優化(APO)框架,該框架統一了許多現有的偏好優化算法,並採用Nesterov的動量技術來加速LLMs的對齊。從理論上來看,我們證明了APO可以實現比標準的迭代偏好優化方法更快的收斂速度,包括DPO和自我對弈偏好優化(SPPO)。在實踐中,我們展示了APO在AlpacaEval 2.0基準測試中優於DPO、迭代DPO和其他強基準的RLHF。
English
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

Summary

AI-Generated Summary

PDF52November 16, 2024