Optimización Acelerada de Preferencias para la Alineación de Modelos de Lenguaje Grandes

Resumen

El Aprendizaje por Refuerzo a partir de la Retroalimentación Humana (ARRH) ha surgido como una herramienta fundamental para alinear grandes modelos de lenguaje (GML) con las preferencias humanas. La Optimización Directa de Preferencias (ODP), uno de los enfoques más populares, formula el ARRH como un problema de optimización de políticas sin estimar explícitamente la función de recompensa. Supera los problemas de estabilidad y eficiencia de enfoques de dos pasos, que típicamente implican primero estimar la función de recompensa y luego optimizar la política a través de la optimización de políticas proximales (OPP). Dado que el ARRH es fundamentalmente un problema de optimización, y es bien sabido que las técnicas de momento pueden acelerar la optimización tanto teórica como empíricamente, surge naturalmente la pregunta: ¿Puede el ARRH ser acelerado por momento? Este documento responde afirmativamente a esta pregunta. En detalle, primero mostramos que el método iterativo de optimización de preferencias puede ser visto como un método de punto proximal. Basándonos en esta observación, proponemos un marco general de Optimización Acelerada de Preferencias (OAP), que unifica muchos algoritmos de optimización de preferencias existentes y emplea la técnica de momento de Nesterov para acelerar la alineación de GML. Teóricamente, demostramos que OAP puede lograr una tasa de convergencia más rápida que los métodos estándar iterativos de optimización de preferencias, incluyendo ODP y Optimización de Preferencias de Autojuego (OPA). Empíricamente, mostramos la superioridad de OAP sobre ODP, ODP iterativo y otras líneas de base sólidas para ARRH en el banco de pruebas AlpacaEval 2.0.

English

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

Optimización Acelerada de Preferencias para la Alineación de Modelos de Lenguaje Grandes

Accelerated Preference Optimization for Large Language Model Alignment

Resumen

Support